An Alternative to the "It's All Your Fault" Theory

sirkris

Chieftain
Joined
Dec 16, 2005
Messages
57
Greetings,

As most of you already know, there has been a severe problem with MP connectivity in Civ4. The predominant explanation given by the lead designer has been that it's caused by players who don't have their firewalls or routers configured properly. As a programmer who has experience building p2p network protocols (both UDP and TCP/IP), I was highly skeptical of this explanation to say the least, as this is often nothing more than a cop-out for developers who are faced with a messy protocol full of bugs. That being said, it will not be the intent of this article to make such an allegation against the Civ4 developers; instead, this article will offer an alternative theory as to the cause of these connectivity issues based on the available data at hand as well as independent research on my part. Since I obviously don't have access to the source code, you must keep in mind that this is only a theory that I have no way of concretely proving, and that I won't be able to supply any code-specifics without the source.

First, I will give you a basic rundown of the testing I performed in the past few weeks regarding this issue, and what my findings have been thus far. Also, please note that all times of day I list are my local time, Pacific Standard Time, which is UTC-8 I believe (please correct me if I'm wrong on that), and EST-3 hours.

In the interest of being concise, I will give you a comparison of today's experiment with last night's, as the results are pretty consistent with all the previous data. Also please note that I have no firewall software of any kind installed on this PC, and all the Windows XP built-in firewall crap has been thoroughly disabled. For testing, I am on a Comcast cable connection via a D-Link 714P+ router, hard wired, which I disconnect for the second phase of testing.

Today, at 3:20 PM PST (or 4:20 PM daylight savings time), I counted 94 players in the lobby. At peak times, it averages around 90-110 players, so this would be considered peak. After counting, I tried joining 5 games at random. Two had only 1 peer, one had 3 peers, one had 5, and the other had 2. I was only able to get through to one game, the one that had 3 peers. All the others got stuck on the first "Contacting peer," except for the 5-peer one, which got stuck on the second peer. In total, that makes 4 good peers, and 8 bad peers (assuming subsequent peers are bad as well, which is impossible to tell I know, since Civ4 doesn't have any real debugging in place to assist the player in finding out). That's a 2:1 bad-to-good peer ratio during peak hours, which is consistant with previous data. I then tried hosting a game and asked people to try to join. Three people (that I know of) tried, and none of them was able to get past "Contacting host." I then exited Civ4 and shutdown my PC. I unplugged the cable modem and router (aka powercycle), disconnected my PC from the router and connected it directly into the cable modem, then waited 60 seconds. I then booted the PC back up and ran Civ4 again. At this point I counted 93 players in the lobby, so roughly the same number. It's impossible to make it a perfect experiment, as the same games are rarely up by this time, so I tried joining 5 different random games with roughly the same numbers of peers. Two had 1 peer, two had 4 peers, and one had 2 peers. Again, I was only able to successfully join one of the games, this time one of the 1-peer hosts. On all the rest, I wasn't able to get past the first peer. So, using the same standard as before, I got 1 good peer, and 7 bad peers. I then tried hosting, and same as before, nobody could connect to host.

Summary: It actually got worse after taking the router out of the equation. Mind you, this was a statistical anomaly I believe, as usually the results are pretty much the same without the router, or sometimes slightly better. In other words, whether or not I have the router connected doesn't seem to make any difference whatsoever. On a few afternoons I was able to connect to the same hosts (no way of knowing if they were the same peers though), and the results were pretty much the same with and without the router. So, at very least, we can rule out firewall and router from being the cause of my p2p connectivity issues, which is the central thesis of the lead developer's explanation.


Now let's go back to last night's experiment. At 1:00 AM PST, I counted 59 players in the lobby, so obviously not a peak hour and significantly less network traffic. With the router connected, I tried joining 5 games. One had 1 peer, three had 2 peers, and one had 6 peers. Of them, I was able to successfully join every single one except the 1-peer game. In other words, that's a solid 1:12 bad-to-good peer ratio. I then tried hosting, and within 5 minutes I had 7 players (including myself). They were a bit pissed when I bailed to finish the experiment, but I promised to rehost in a few minutes once I was finished with the second phase. So again, I powercycled the modem/router, connected the PC directly into the modem, and rebooted the PC. At this point there were 61 players in the lobby by my count. I tried joining 5 games. Two had 1 peer, one had 2 peers, one had 3 peers, and one had 5 peers. Again, I was able to successfully connect to all but one, the 5-peer game, on which it got stuck on the 4th peer. Assuming the 5th was bad as well, that's still a 1:6 bad-to-good peer ratio without the router. I then tried hosting, and got 6 players (inlcuding myself) within the first 5 minutes. I wasn't able to get anyone else to join (probably cuz I bailed on them earlier).

Summary: Results virtually identical with/without router. Very little, if any, difficulty in joining/hosting games.


Conclusion/Findings: With no firewall installed, the only variable I had affecting connectivity, according to the lead developer's explanation, was my router. However, both during peak times and non-peak, there was virtually no difference in connectivity between when I used the router and when I was connected directly to the cable modem. This fits what many people in the lobby have said to me as well; i.e. people having p2p communication issues without firewall or router installed. One could argue that the bad peers I was running into were the ones with firewall/router issues, but that wouldn't explain the issues when I tried hosting and nobody could even connect to me during peak hours.

So, let us set aside the lead developer's explanation for a moment and consider another possible variable affecting connectivity: network traffic. Consistently, during the past few weeks that I've been conducting these experiments (on average about 3-4 days a week, time-permitting), I have witnessed the exact same trend: Better connectivity when there are fewer players in the lobby, worse connectivity when there are more. Therefore, we can deduce from this preliminary data that network traffic is directly related to the peer-to-peer communication issues.

The causal relationship is difficult to pinpoint; i.e., is network traffic directly the cause, or is it that when there are more people in the lobby, there is a greater likelyhood that more of them have firewalls that disrupt the flow of traffic? Given the consistency of the correlation, I would be inclined to discount that as a viable explanation, but it is nonetheless a possibility.

I would thus theorize that the amount of gamespy traffic is the direct cause of the peer-to-peer (p2p) connecivity issues. For those of you familiar with the concept of Denial of Service (DoS), you know that, if you flood an open port with too much data, it can prevent other connections from successfully being made to that port. I believe this may be what's happening with Civ4, though I obviously couldn't confirm or deny that as I'm not a developer with access to the source code. However, I once developed a p2p network in C that allowed various telnet-based game servers (or "MUDs") to connect to one another via a decentralized chat network. It worked fine when there were a small number of users, but when more connected and network traffic increased, it started to become very unreliable, not always relaying packets from hub to hub; the data was sent ok, but not received. Also, when traffic got really high, servers had difficulty connecting to the network during peak times. How did I solve it? Well, I never was able to completely. However, by incorporating a packet compression protocol and reducing the amount of data that was relayed, I made it so the network was able to handle much more traffic before it began to have denial of service issues. I also considered adding more specialized ports; i.e. some ports dedicated to chat packets, others dedicated to private tells, others dedicated to new connections, etc, but I never got around to implementing it.

There were times when I was tempted to just blame the problem on others, saying they probably didn't have the code installed properly, or had firewalls or just slow connections. But that was just a cop-out and I knew it.


Possible Solution: Instead of blaming the problem on players with poorly configured firewalls (without actually doing anything to confirm whether or not this is in fact the problem), the developers need to find a way to improve Civ4's socket handling and the efficiency thereof. If gamespy traffic is causing an overload, which is appears to be doing based on the above data, then perhaps they could find a way to compress or reduce that data, or they could setup an extra port(s) to be used exclusively for gamespy traffic, leaving other ports open exclusively for connecting to peers. I cite this as a "possible solution" because I have no way of testing this theory beyond what I have already done. However, based on the data at hand, I believe this solution would definitely help.

At very least, by compressing/reducing the packet size and quantity, and perhaps re-arranging the ports if necessary, they should at least be able to improve the connectivity situation during peak hours. It would be fairly simple to implement I would think, and putting it in the next patch couldn't hurt any.


I hope this helps! =)


--Kristopher
 
Interesting post. Hopefully it will find its way to the Civ4 development team. Another thought, and this may have been mentioned before: During login, the MP server should test that the client has opened all necessary ports. If not, refuse the login and display a helpful message to the user advising which ports need to be opened.

Many modern firewalls also support UPnP, which provides a mechanism with which applications may dynamically request open ports from the firewall.
 
Agreed that there should be a more useful error response than an endless 'contacting peer' screen.

I knew nothing about opening ports on a firewall router until I tried to play Civ4 online, and spent about 4 hours on Sunday trying to find out how to do it. All support on various forums seems to be geared either to specific brands of routers, or to people who wouldn't need the support anyway because they know what they're doing.

eg: "Just forward the packets, noob" etc.

Suffice to say I'm reasonably tech savvy, and have managed to set it up, but it wasn't easy and I can understand why so many people are having problems. I can now connect to virtually any game at any time, so I'm confident that there is something to the firewall issues. I've not had any connection problems since I set up the packet forwarding and virtual server options.

There's no way an average amateur user could do it though. Way too complicated imo.
 
Incidentally, I though I might share this experience:

I hosted 2 games on Sunday night, no peer issues at all. Then, I hosted a third game, and one particular user couldn't contact me as the host, despite other people being able to. She was convinced that her settings were fine, so I asked her to host the game. Again, everyone else could join, but now I couldn't.

I draw from this that even despite firewall settings being set up properly, some users are just unable to contact some other users. I'm sure there is a setting either on her firewall or mine that is making this happen, but God only knows what that might be.
 
Very good points you guys made. In addition to the fixes I suggested, they should also improve the error messages to better help the user debug what exactly the problem is.

It's hard to say why you two separately would be able to connect to everyone else, but not to each other. Could be some sort of mutual firewall exclusion thing, though I doubt that in this case. These types of weird, unexplainable p2p communication issues are common when the protocol (i.e. the language each peer uses to talk to each other) is not designed very efficiently. I couldn't really go into a more detailed explanation without knowing more about how Civ4 handles port connections and whatnot. Perhaps, for example, the game itself might be rejecting the connection from that person (and vice versa on the other end) because of some arbitrary packet exclusion rule the programmers put in place. Of course, that's just a guess, and one possibility out of many.

Only the developers would be able to explain it better I believe, but so far the only explanation I've seen from them is some variation of, "Fix your router and firewall, noob."
 
Back
Top Bottom