Multiplayer: Hacking A Fix Around OOS Errors

Afforess

The White Wizard
Joined
Jul 31, 2007
Messages
12,239
Location
Austin, Texas
I know one of the biggest issues with Multiplayer in RAND is the inevitable OOS error that crops up. Restarting and resuming is a huge PITA.

What is an OOS?:

Civilization 4 uses an interesting and rather rare multiplayer mechanism. In order to save bandwidth, all the machines run civilization 4 independently. They all start from the same starting point, and because the AI and RNG are deterministic, they stay in "sync". They can fall out of sync one of 2 major ways:

1.) Failure to transmit player actions. When a human operator does an action, that action is broadcast to all other machines. If a modder adds a new action that needs to be transmitted, but isn't, OOS will occur after a player uses a particular feature. This is usually easy to detect, because choosing some setting in a menu leads to an OOS, and you can figure out what you just did to cause it.

2.) Undeterministic code. This is a bug in the codebase that breaks the deterministic design of civilization. Normally, if you start from the same starting point and RNG seed, and play the game the exact same way, you will have the exact same result, every time, without fail. Undeterministic code in some way breaks this and alters the outcome. Tricky to track down without dumping the entire state of every game and comparing. That is what the OOS Logger tries to do, but it can't always catch it.

Why does saving and rejoining fix OOS (if temporarily)?

You essentially are agreeing to use a common state, which fixes the undeterministic code issue.

What can be done to fix MP?

Obviously fixing the undeterministic code would be ideal. The problem is that even base BTS has a few rare OOS bugs. Perfect determinism is hard. Even Firaxis couldn't do it.

That isn't to say we shouldn't try, we should! But we shouldn't expect to succeed, just improve.

So if success is impossible, then what?

Embrace failure.

Saving and rejoining fixes OOS, right? Why not automate that? We can simply have the host (player 1) send their state to all players anytime an OOS is encountered. Internet speeds have vastly improved since the dial-up days, and 3-5MB of data is no problem. It might cost a second or two of lag when an OOS is encountered, but other than that, it should make MP playable again.

Basically how it would work is we could serialize all the state of the game the same way saves already serialize the state of the game, but instead of "saving" it to a file, pipe it over the network, and have the other players deserialize and load the state overtop of their own. Then the OOS should disappear (in theory).

I'd like comments and thoughts from some of the C2C developers, since I know Koshling wrote the new save format and is most familiar with the serialization aspects of the game. But this seems do-able in theory.
 
Seems perfect except for one thing: how many times have you tried MP lately? Not even a single time that I can remember in the last 4 years did hot joining the game work. If you do it, the game stays out of sync. In order to resync both players are to quit the game, host reload, other players join in. At least that's what has always happened to me. Don't remember if it worked in vanilla civ.
 
45°38'N-13°47'E;13333251 said:
Seems perfect except for one thing: how many times have you tried MP lately? Not even a single time that I can remember in the last 4 years did hot joining the game work. If you do it, the game stays out of sync. In order to resync both players are to quit the game, host reload, other players join in. At least that's what has always happened to me. Don't remember if it worked in vanilla civ.

I'm not sure how hot-joining works. But the entire save reset process I am describing is analogous to starting from the lobby and having all other players "download save" before they can launch. Worth looking into why that is bugged though.
 
I'm not sure how hot-joining works. But the entire save reset process I am describing is analogous to starting from the lobby and having all other players "download save" before they can launch. Worth looking into why that is bugged though.
I'm sure but I fear the problem I'm describing implies that even the host should reload the game or you'll still be out of sync.
 
Now that's something I would like to see a lot. If OOSs can be dealt with better or at least faster, that would be great. Still there is the cache cleaning right? It seems clearing the cache helps, so how can both things be done togetherw Is it possible?
 
45°38'N-13°47'E;13333287 said:
I'm sure but I fear the problem I'm describing implies that even the host should reload the game or you'll still be out of sync.

The host should have the "good" state of the game. When OOS occurs, its the other players out of sync. You can't be out of sync with only 1 player (that's how SP works)

Now that's something I would like to see a lot. If OOSs can be dealt with better or at least faster, that would be great. Still there is the cache cleaning right? It seems clearing the cache helps, so how can both things be done togetherw Is it possible?

XML cache should not affect MP at all.
 
The host should have the "good" state of the game. When OOS occurs, its the other players out of sync. You can't be out of sync with only 1 player.

That's what I thought too, but if you try a Pitboss game you'll see two players getting out of sync errors, yet they are in sync and it's the hosting PC that's out of sync. So for some reason state of host isn't being transmitted correctly when hot joining, it only works if both quit the game and reconnect. Sometimes it only works if you quit civ4 completely, going back to main menu isn't enough. I know because I've tried multiple times.
 
45°38'N-13°47'E;13333430 said:
That's what I thought too, but if you try a Pitboss game you'll see two players getting out of sync errors, yet they are in sync and it's the hosting PC that's out of sync. So for some reason state of host isn't being transmitted correctly when hot joining, it only works if both quit the game and reconnect. Sometimes it only works if you quit civ4 completely, going back to main menu isn't enough. I know because I've tried multiple times.

Pitboss is different. Pitboss means the pitboss application acts like a server, it is the central "host", not player 1.
 
Pitboss is different. Pitboss means the pitboss application acts like a server, it is the central "host", not player 1.
I know, I was just saying that you can't hot join any game and since pitboss always starts with players hot joining the game started by the server, it doesn't work too probably for the same reason: hot joining isn't working.
 
Pitboss is different. Pitboss means the pitboss application acts like a server, it is the central "host", not player 1.
Except for not actually having a player playing at that computer it still does pretty much the same thing as a normal player computer would do.
So in regards to synchronization there is no real difference.

But yes, hot joining does not work properly. Nor does reloading unless both players exit the program and restart first. So something is not properly cleaned up on returning to the main menu that influences the gamestate on rejoining. Most likely that something is in the Python environment.

It might well be though that serializing the gamestate and streaming it from the host has a higher chance of fixing the OoS state. It might be slow though given that you have to cut it up into small chunks to send it over the message system that is likely not optimized for that kind of stream and might have limits we are not aware of yet.

If it works though that is a significant improvement.
 
I'd brought up a nearly identical idea with AIAndy a while back and he agreed it could be done. What we'd have to watch out for are those situations where a player that isn't the 'primary' has an effect, like say a free tech or something, that throws things OOS and only their system records the benefits. The primary resets the game and the secondary loses out on whatever just took place for them because of a failure to report. This can lead to rather unfair and much harder to figure out problems because we're now made numb to caring to resolve these royal pains in the arse because we have this nifty workaround.

Otherwise, as I theorized back then, you're absolutely right that it should be quite useful and SHOULD be easily done. 45* is right though that one player exiting and re-entering (hot-reload) while the other keeps the game going just doesn't tend to work at all. So that may be some issue in the Firaxis core (I don't think it's EVER worked well.)

AIAndy offered further insights at the time on how to go about it and it was done in our forum SOMEWHERE (probably in one of the many threads dedicated to OOS fixing in C2Cs bug threads forum) but I didn't understand it much at the time. His understanding of data management in loads and synchronized states far exceeds mine.

However, I did have a lot of success debugging a great many OOS errors recently by generating a logging mechanism that allowed me to gradually narrow in on the function that was creating the problem then reporting to the logs line by line various values in use in the function to show which line went awry. In most of those cases it was an uninitialized variable that was sent into a pointer in a function call parameter (&blabla) that caused variation between the two systems when the function did not generate a result for that parameter, leaving a later call for its value to be an enigmatic undefined value that both systems randomly selected differently for themselves. Caching has also been determined to be a major contributor in some games and is usually less necessary in multi-play with simultaneous turns. The multi-core stuff can cause issues too so its advised for players to go into the globals and turn the core count down from 4 to 1 if playing multi.

I do think there's some issues in rev too... be it in the python or the core (and there's a major tag problem there last I checked between the building tag for national vs local stability modifier you might want to look into for AND as well. Haven't fixed that yet for C2C since I don't really 'get' rev very well as I don't play it so as to minimize OOS errors when playing with the Missus.
 
I agree that it could be done, but it is by no means 'low hanging fruit'. You'd have to implement you're own network connection since the game provides no direct-enough access to it, and your own enforce-this-state protocol over it (which could plausibly re-use the load save code). Because the OOS can be detected anywhere, but only the host could force its state out (else you risk state clashes if multiple nodes try to impose their state when several go OOS), you'd need that protocol to support OOS notifications from the slaves to the host, to which it would respond by sending out force syncs (which would have to be to all players I think).

Another (perhaps simpler, provided you are happy with sequential rather than simultaneous play) thing you could do is go ahead and implement a load/save via network, and run the game in effectively PBEM mode, where at the end of the turn the node completing its turn pushes a remote load of its end-of-turn state to the next player's computer. That way everything is clean at the start of each turn. From there you could modify the code (relatively simply) to allow players whose turn it is NOT to use the UI and LOOK at things, but disallow actions (until it is their turn). The result would be PBEM without he email, and where you can continue to plan when its not your turn. I think this would be much easier.
 
I agree that it could be done, but it is by no means 'low hanging fruit'. You'd have to implement you're own network connection since the game provides no direct-enough access to it, and your own enforce-this-state protocol over it (which could plausibly re-use the load save code). Because the OOS can be detected anywhere, but only the host could force its state out (else you risk state clashes if multiple nodes try to impose their state when several go OOS), you'd need that protocol to support OOS notifications from the slaves to the host, to which it would respond by sending out force syncs (which would have to be to all players I think).

There is a way for net packets to be sent already, can't we abuse that to send the entire game state? Admittedly I have never tried to send any large amount of data through it, I don't know how it would handle that. But we should theoretically be able to send a CvMessageData struct of any size. If that works, I don't think we need to create a separate network connection.

Another (perhaps simpler, provided you are happy with sequential rather than simultaneous play) thing you could do is go ahead and implement a load/save via network, and run the game in effectively PBEM mode, where at the end of the turn the node completing its turn pushes a remote load of its end-of-turn state to the next player's computer. That way everything is clean at the start of each turn. From there you could modify the code (relatively simply) to allow players whose turn it is NOT to use the UI and LOOK at things, but disallow actions (until it is their turn). The result would be PBEM without he email, and where you can continue to plan when its not your turn. I think this would be much easier.

I think that is a solid plan B. I prefer sequential turns because it feels like a real board game, and it is easy to abuse simultaneous turns in war.
 
There is a way for net packets to be sent already, can't we abuse that to send the entire game state? Admittedly I have never tried to send any large amount of data through it, I don't know how it would handle that. But we should theoretically be able to send a CvMessageData struct of any size. If that works, I don't think we need to create a separate network connection.
There is a limit. I ran into it when I transferred the custom build lists (that are read from local files so they have to be sent to the other computers to work in multiplayer). For longer lists that did not work (not in single player either so it can be tested there).
I do not remember the exact limit but I seem to have split the lists into chunks of 100, so the limit seems to be somewhere around 100 OrderData structs. So the limit might actually be a standard network packet.
 
There is a limit. I ran into it when I transferred the custom build lists (that are read from local files so they have to be sent to the other computers to work in multiplayer). For longer lists that did not work (not in single player either so it can be tested there).
I do not remember the exact limit but I seem to have split the lists into chunks of 100, so the limit seems to be somewhere around 100 OrderData structs. So the limit might actually be a standard network packet.

Ok. The underlying connection is TCP, right? We can count on packets arriving in order?

If so, it should be straightfoward to read the game state into memory, chunk the bytes up into CvMessageData packets, and unchunk them on the other side.
 
I think that is a solid plan B. I prefer sequential turns because it feels like a real board game, and it is easy to abuse simultaneous turns in war.
Isn't that basically just multi-player mode with the simultaneous turns turned off anyhow? I'd be willing to bet if players played like this they'd get very few if any OOS errors as it is. In otherwords, I suspect this 'plan B' is already stock available.

Ok. The underlying connection is TCP, right? We can count on packets arriving in order?

If so, it should be straightfoward to read the game state into memory, chunk the bytes up into CvMessageData packets, and unchunk them on the other side.
I was thinking similarly but if you must do this it may be best to break it up into popups to control the packet feeds. A simple OK button to proceed to the next one between each so that all the computers on the network are enforced to have caught up to the last one before moving on. If you overwhelm the delay that's incurred as large amounts of data are transferred mid-game you'll probably time out the slower computers on the network and thus mute the whole purpose of the mechanism in the first place as it will inadvertently sever the connection anyhow.

Perhaps even enforce some kind of feedback from all computers to let the primary know that all other computers have completed receipt and reprocessing and thus it can now proceed to the next one thus prompting the next popup.

This is just a suspicion based on seeing how recalculations later in the game will cause a timeout as it simply takes too long and during this time one computer thinks the other has frozen up so forces a disconnection. At that point in time the only solution is to shut down the game on both computers and reload it so it is a significant problem with the recalculation mechanism already. Usually works fine if it's done then the game is saved and then reloaded on both though.
 
Isn't that basically just multi-player mode with the simultaneous turns turned off anyhow? I'd be willing to bet if players played like this they'd get very few if any OOS errors as it is. In otherwords, I suspect this 'plan B' is already stock available.

I've tried it in desperation a few times with friends and it would still OOS, seemed no better from memory.
 
FYI, I have very minor progress to report here. I have discovered one cause of the immediate OOS on Pitboss. The random seed for the CvGame object goes out of sync almost immediately (it is serialized to the clients correctly, I verified that). [Interestingly, the map CvRandom seed stays consistent]

Manually setting the seed back (by means of editing the variable value in the DEBUG dll) causes the OOS to go away. Advancing one turn, and it re-appears. So there is some inconsistency in the use of the RNG between the "normal" client and the Pitboss mode. I am attempting to track this inconsistency down.

The main progress is that I have narrowed down the OOS issue for Pitboss from the entirety of the codebase to just the use of the Soren RNG.
 
Progress Update: I have found the cause of the initial OOS when starting a new pitboss game. I have fixed it locally and will add it to the codebase soonish.

There are still tons of OOS issues I still need to tackle. I would put my progress at fixing MP at ~5%.

Edit: For the curious, the OOS log at pitboss startup is caused by the zorbrist plot cache, fixed by switching it to async rands...

(From CvPlot constructor)
Code:
	//Afforess: use async rand so as to not pollute mp games
	m_zobristContribution = GC.getASyncRand().getInt();

	if ( !g_plotTypeZobristHashesSet )
	{
		for(int i = 0; i < NUM_PLOT_TYPES; i++)
		{
			//Afforess: use async rand so as to not pollute mp games
			g_plotTypeZobristHashes[i] = GC.getASyncRand().getInt();
		}

		g_plotTypeZobristHashesSet = true;
	}

I am currently tracking down an OOS caused by selecting a new construct or train order when founding a city (from the popup, not from examining the city, then choosing). The OOS is only caused from picking an item from the popup, not from inside of the city screen. The issue seems related to city ids, it is possible the client is founding the city twice, causing IDs to go out of sync. Not 100% sure on that.
 
Top Bottom