Greetings,
As most of you already know, there has been a severe problem with MP connectivity in Civ4. The predominant explanation given by the lead designer has been that it's caused by players who don't have their firewalls or routers configured properly. As a programmer who has experience building p2p network protocols (both UDP and TCP/IP), I was highly skeptical of this explanation to say the least, as this is often nothing more than a cop-out for developers who are faced with a messy protocol full of bugs. That being said, it will not be the intent of this article to make such an allegation against the Civ4 developers; instead, this article will offer an alternative theory as to the cause of these connectivity issues based on the available data at hand as well as independent research on my part. Since I obviously don't have access to the source code, you must keep in mind that this is only a theory that I have no way of concretely proving, and that I won't be able to supply any code-specifics without the source.
First, I will give you a basic rundown of the testing I performed in the past few weeks regarding this issue, and what my findings have been thus far. Also, please note that all times of day I list are my local time, Pacific Standard Time, which is UTC-8 I believe (please correct me if I'm wrong on that), and EST-3 hours.
In the interest of being concise, I will give you a comparison of today's experiment with last night's, as the results are pretty consistent with all the previous data. Also please note that I have no firewall software of any kind installed on this PC, and all the Windows XP built-in firewall crap has been thoroughly disabled. For testing, I am on a Comcast cable connection via a D-Link 714P+ router, hard wired, which I disconnect for the second phase of testing.
Today, at 3:20 PM PST (or 4:20 PM daylight savings time), I counted 94 players in the lobby. At peak times, it averages around 90-110 players, so this would be considered peak. After counting, I tried joining 5 games at random. Two had only 1 peer, one had 3 peers, one had 5, and the other had 2. I was only able to get through to one game, the one that had 3 peers. All the others got stuck on the first "Contacting peer," except for the 5-peer one, which got stuck on the second peer. In total, that makes 4 good peers, and 8 bad peers (assuming subsequent peers are bad as well, which is impossible to tell I know, since Civ4 doesn't have any real debugging in place to assist the player in finding out). That's a 2:1 bad-to-good peer ratio during peak hours, which is consistant with previous data. I then tried hosting a game and asked people to try to join. Three people (that I know of) tried, and none of them was able to get past "Contacting host." I then exited Civ4 and shutdown my PC. I unplugged the cable modem and router (aka powercycle), disconnected my PC from the router and connected it directly into the cable modem, then waited 60 seconds. I then booted the PC back up and ran Civ4 again. At this point I counted 93 players in the lobby, so roughly the same number. It's impossible to make it a perfect experiment, as the same games are rarely up by this time, so I tried joining 5 different random games with roughly the same numbers of peers. Two had 1 peer, two had 4 peers, and one had 2 peers. Again, I was only able to successfully join one of the games, this time one of the 1-peer hosts. On all the rest, I wasn't able to get past the first peer. So, using the same standard as before, I got 1 good peer, and 7 bad peers. I then tried hosting, and same as before, nobody could connect to host.
Summary: It actually got worse after taking the router out of the equation. Mind you, this was a statistical anomaly I believe, as usually the results are pretty much the same without the router, or sometimes slightly better. In other words, whether or not I have the router connected doesn't seem to make any difference whatsoever. On a few afternoons I was able to connect to the same hosts (no way of knowing if they were the same peers though), and the results were pretty much the same with and without the router. So, at very least, we can rule out firewall and router from being the cause of my p2p connectivity issues, which is the central thesis of the lead developer's explanation.
Now let's go back to last night's experiment. At 1:00 AM PST, I counted 59 players in the lobby, so obviously not a peak hour and significantly less network traffic. With the router connected, I tried joining 5 games. One had 1 peer, three had 2 peers, and one had 6 peers. Of them, I was able to successfully join every single one except the 1-peer game. In other words, that's a solid 1:12 bad-to-good peer ratio. I then tried hosting, and within 5 minutes I had 7 players (including myself). They were a bit pissed when I bailed to finish the experiment, but I promised to rehost in a few minutes once I was finished with the second phase. So again, I powercycled the modem/router, connected the PC directly into the modem, and rebooted the PC. At this point there were 61 players in the lobby by my count. I tried joining 5 games. Two had 1 peer, one had 2 peers, one had 3 peers, and one had 5 peers. Again, I was able to successfully connect to all but one, the 5-peer game, on which it got stuck on the 4th peer. Assuming the 5th was bad as well, that's still a 1:6 bad-to-good peer ratio without the router. I then tried hosting, and got 6 players (inlcuding myself) within the first 5 minutes. I wasn't able to get anyone else to join (probably cuz I bailed on them earlier).
Summary: Results virtually identical with/without router. Very little, if any, difficulty in joining/hosting games.
Conclusion/Findings: With no firewall installed, the only variable I had affecting connectivity, according to the lead developer's explanation, was my router. However, both during peak times and non-peak, there was virtually no difference in connectivity between when I used the router and when I was connected directly to the cable modem. This fits what many people in the lobby have said to me as well; i.e. people having p2p communication issues without firewall or router installed. One could argue that the bad peers I was running into were the ones with firewall/router issues, but that wouldn't explain the issues when I tried hosting and nobody could even connect to me during peak hours.
So, let us set aside the lead developer's explanation for a moment and consider another possible variable affecting connectivity: network traffic. Consistently, during the past few weeks that I've been conducting these experiments (on average about 3-4 days a week, time-permitting), I have witnessed the exact same trend: Better connectivity when there are fewer players in the lobby, worse connectivity when there are more. Therefore, we can deduce from this preliminary data that network traffic is directly related to the peer-to-peer communication issues.
The causal relationship is difficult to pinpoint; i.e., is network traffic directly the cause, or is it that when there are more people in the lobby, there is a greater likelyhood that more of them have firewalls that disrupt the flow of traffic? Given the consistency of the correlation, I would be inclined to discount that as a viable explanation, but it is nonetheless a possibility.
I would thus theorize that the amount of gamespy traffic is the direct cause of the peer-to-peer (p2p) connecivity issues. For those of you familiar with the concept of Denial of Service (DoS), you know that, if you flood an open port with too much data, it can prevent other connections from successfully being made to that port. I believe this may be what's happening with Civ4, though I obviously couldn't confirm or deny that as I'm not a developer with access to the source code. However, I once developed a p2p network in C that allowed various telnet-based game servers (or "MUDs") to connect to one another via a decentralized chat network. It worked fine when there were a small number of users, but when more connected and network traffic increased, it started to become very unreliable, not always relaying packets from hub to hub; the data was sent ok, but not received. Also, when traffic got really high, servers had difficulty connecting to the network during peak times. How did I solve it? Well, I never was able to completely. However, by incorporating a packet compression protocol and reducing the amount of data that was relayed, I made it so the network was able to handle much more traffic before it began to have denial of service issues. I also considered adding more specialized ports; i.e. some ports dedicated to chat packets, others dedicated to private tells, others dedicated to new connections, etc, but I never got around to implementing it.
There were times when I was tempted to just blame the problem on others, saying they probably didn't have the code installed properly, or had firewalls or just slow connections. But that was just a cop-out and I knew it.
Possible Solution: Instead of blaming the problem on players with poorly configured firewalls (without actually doing anything to confirm whether or not this is in fact the problem), the developers need to find a way to improve Civ4's socket handling and the efficiency thereof. If gamespy traffic is causing an overload, which is appears to be doing based on the above data, then perhaps they could find a way to compress or reduce that data, or they could setup an extra port(s) to be used exclusively for gamespy traffic, leaving other ports open exclusively for connecting to peers. I cite this as a "possible solution" because I have no way of testing this theory beyond what I have already done. However, based on the data at hand, I believe this solution would definitely help.
At very least, by compressing/reducing the packet size and quantity, and perhaps re-arranging the ports if necessary, they should at least be able to improve the connectivity situation during peak hours. It would be fairly simple to implement I would think, and putting it in the next patch couldn't hurt any.
I hope this helps! =)
--Kristopher
As most of you already know, there has been a severe problem with MP connectivity in Civ4. The predominant explanation given by the lead designer has been that it's caused by players who don't have their firewalls or routers configured properly. As a programmer who has experience building p2p network protocols (both UDP and TCP/IP), I was highly skeptical of this explanation to say the least, as this is often nothing more than a cop-out for developers who are faced with a messy protocol full of bugs. That being said, it will not be the intent of this article to make such an allegation against the Civ4 developers; instead, this article will offer an alternative theory as to the cause of these connectivity issues based on the available data at hand as well as independent research on my part. Since I obviously don't have access to the source code, you must keep in mind that this is only a theory that I have no way of concretely proving, and that I won't be able to supply any code-specifics without the source.
First, I will give you a basic rundown of the testing I performed in the past few weeks regarding this issue, and what my findings have been thus far. Also, please note that all times of day I list are my local time, Pacific Standard Time, which is UTC-8 I believe (please correct me if I'm wrong on that), and EST-3 hours.
In the interest of being concise, I will give you a comparison of today's experiment with last night's, as the results are pretty consistent with all the previous data. Also please note that I have no firewall software of any kind installed on this PC, and all the Windows XP built-in firewall crap has been thoroughly disabled. For testing, I am on a Comcast cable connection via a D-Link 714P+ router, hard wired, which I disconnect for the second phase of testing.
Today, at 3:20 PM PST (or 4:20 PM daylight savings time), I counted 94 players in the lobby. At peak times, it averages around 90-110 players, so this would be considered peak. After counting, I tried joining 5 games at random. Two had only 1 peer, one had 3 peers, one had 5, and the other had 2. I was only able to get through to one game, the one that had 3 peers. All the others got stuck on the first "Contacting peer," except for the 5-peer one, which got stuck on the second peer. In total, that makes 4 good peers, and 8 bad peers (assuming subsequent peers are bad as well, which is impossible to tell I know, since Civ4 doesn't have any real debugging in place to assist the player in finding out). That's a 2:1 bad-to-good peer ratio during peak hours, which is consistant with previous data. I then tried hosting a game and asked people to try to join. Three people (that I know of) tried, and none of them was able to get past "Contacting host." I then exited Civ4 and shutdown my PC. I unplugged the cable modem and router (aka powercycle), disconnected my PC from the router and connected it directly into the cable modem, then waited 60 seconds. I then booted the PC back up and ran Civ4 again. At this point I counted 93 players in the lobby, so roughly the same number. It's impossible to make it a perfect experiment, as the same games are rarely up by this time, so I tried joining 5 different random games with roughly the same numbers of peers. Two had 1 peer, two had 4 peers, and one had 2 peers. Again, I was only able to successfully join one of the games, this time one of the 1-peer hosts. On all the rest, I wasn't able to get past the first peer. So, using the same standard as before, I got 1 good peer, and 7 bad peers. I then tried hosting, and same as before, nobody could connect to host.
Summary: It actually got worse after taking the router out of the equation. Mind you, this was a statistical anomaly I believe, as usually the results are pretty much the same without the router, or sometimes slightly better. In other words, whether or not I have the router connected doesn't seem to make any difference whatsoever. On a few afternoons I was able to connect to the same hosts (no way of knowing if they were the same peers though), and the results were pretty much the same with and without the router. So, at very least, we can rule out firewall and router from being the cause of my p2p connectivity issues, which is the central thesis of the lead developer's explanation.
Now let's go back to last night's experiment. At 1:00 AM PST, I counted 59 players in the lobby, so obviously not a peak hour and significantly less network traffic. With the router connected, I tried joining 5 games. One had 1 peer, three had 2 peers, and one had 6 peers. Of them, I was able to successfully join every single one except the 1-peer game. In other words, that's a solid 1:12 bad-to-good peer ratio. I then tried hosting, and within 5 minutes I had 7 players (including myself). They were a bit pissed when I bailed to finish the experiment, but I promised to rehost in a few minutes once I was finished with the second phase. So again, I powercycled the modem/router, connected the PC directly into the modem, and rebooted the PC. At this point there were 61 players in the lobby by my count. I tried joining 5 games. Two had 1 peer, one had 2 peers, one had 3 peers, and one had 5 peers. Again, I was able to successfully connect to all but one, the 5-peer game, on which it got stuck on the 4th peer. Assuming the 5th was bad as well, that's still a 1:6 bad-to-good peer ratio without the router. I then tried hosting, and got 6 players (inlcuding myself) within the first 5 minutes. I wasn't able to get anyone else to join (probably cuz I bailed on them earlier).
Summary: Results virtually identical with/without router. Very little, if any, difficulty in joining/hosting games.
Conclusion/Findings: With no firewall installed, the only variable I had affecting connectivity, according to the lead developer's explanation, was my router. However, both during peak times and non-peak, there was virtually no difference in connectivity between when I used the router and when I was connected directly to the cable modem. This fits what many people in the lobby have said to me as well; i.e. people having p2p communication issues without firewall or router installed. One could argue that the bad peers I was running into were the ones with firewall/router issues, but that wouldn't explain the issues when I tried hosting and nobody could even connect to me during peak hours.
So, let us set aside the lead developer's explanation for a moment and consider another possible variable affecting connectivity: network traffic. Consistently, during the past few weeks that I've been conducting these experiments (on average about 3-4 days a week, time-permitting), I have witnessed the exact same trend: Better connectivity when there are fewer players in the lobby, worse connectivity when there are more. Therefore, we can deduce from this preliminary data that network traffic is directly related to the peer-to-peer communication issues.
The causal relationship is difficult to pinpoint; i.e., is network traffic directly the cause, or is it that when there are more people in the lobby, there is a greater likelyhood that more of them have firewalls that disrupt the flow of traffic? Given the consistency of the correlation, I would be inclined to discount that as a viable explanation, but it is nonetheless a possibility.
I would thus theorize that the amount of gamespy traffic is the direct cause of the peer-to-peer (p2p) connecivity issues. For those of you familiar with the concept of Denial of Service (DoS), you know that, if you flood an open port with too much data, it can prevent other connections from successfully being made to that port. I believe this may be what's happening with Civ4, though I obviously couldn't confirm or deny that as I'm not a developer with access to the source code. However, I once developed a p2p network in C that allowed various telnet-based game servers (or "MUDs") to connect to one another via a decentralized chat network. It worked fine when there were a small number of users, but when more connected and network traffic increased, it started to become very unreliable, not always relaying packets from hub to hub; the data was sent ok, but not received. Also, when traffic got really high, servers had difficulty connecting to the network during peak times. How did I solve it? Well, I never was able to completely. However, by incorporating a packet compression protocol and reducing the amount of data that was relayed, I made it so the network was able to handle much more traffic before it began to have denial of service issues. I also considered adding more specialized ports; i.e. some ports dedicated to chat packets, others dedicated to private tells, others dedicated to new connections, etc, but I never got around to implementing it.
There were times when I was tempted to just blame the problem on others, saying they probably didn't have the code installed properly, or had firewalls or just slow connections. But that was just a cop-out and I knew it.
Possible Solution: Instead of blaming the problem on players with poorly configured firewalls (without actually doing anything to confirm whether or not this is in fact the problem), the developers need to find a way to improve Civ4's socket handling and the efficiency thereof. If gamespy traffic is causing an overload, which is appears to be doing based on the above data, then perhaps they could find a way to compress or reduce that data, or they could setup an extra port(s) to be used exclusively for gamespy traffic, leaving other ports open exclusively for connecting to peers. I cite this as a "possible solution" because I have no way of testing this theory beyond what I have already done. However, based on the data at hand, I believe this solution would definitely help.
At very least, by compressing/reducing the packet size and quantity, and perhaps re-arranging the ports if necessary, they should at least be able to improve the connectivity situation during peak hours. It would be fairly simple to implement I would think, and putting it in the next patch couldn't hurt any.
I hope this helps! =)
--Kristopher