mp games oos

gdambrauskas

Warlord
Joined
Mar 15, 2010
Messages
123
Location
New York
Is that still the case in recent SVN versions? I fixed one major source of OOS two weeks ago that caused very frequent OOS.

got svn code today and gave it a shot, was actually way worse than our earlier game. Got oos on plopping city (suspecting that free city guardian is causing it), but decided to carry on after saving and reloading. Then there was oos every 1-2 turns in the 1st 10 turns of the game. I have revs on and we play simultaneous turns. Earlier game i think we made past 50 turns until oos, but now the game is completely unplayable in mp.

I had logging on, it has mine and my friend's saves and logs.

http://www.filefactory.com/file/4xpoitumta7/n/Beyond_the_Sword.zip

And i think if anyone tries to just play mp game with revs on/simultenous turns, it should super easy to reproduce.

this is with svn revision 3459
 
also i have a question (been reading how to debug oos problems)

this uses logging to MPlog.txt

http://forums.civfanatics.com/showthread.php?t=188460&highlight=multiplayer+guide

which is actually removed

/*************************************************************************************************/
/** Xienwolf Tweak 02/22/09 **/
/** **/
/** Horrible method of logging randoms. Removed to save from uselessly flooding filesize **/
/*************************************************************************************************/
/** ---- Start Original Code ---- **
if (pszLog != NULL)
{
if (GC.getLogging() && GC.getRandLogging())
{
if (GC.getGameINLINE().getTurnSlice() > 0)
{
TCHAR szOut[1024];
sprintf(szOut, "Rand = %d on %d (%s)\n", getSeed(), GC.getGameINLINE().getTurnSlice(), pszLog);


is oos log and rand number log enough? Would the above not help to quicker pinpoint issues(especially for the problems around random number generation getting out of synch for different players)? I diffed oos logs between me and my wife and the only thing different was rand number, so was looking how to debug such problems.

thanks for any pointers you can give (was reading a thread "multiplayer woes" to get some ideas on nailing down oos)
 
I'm so very happy someone is taking a direct and hardcore look into this problem. Our own game is completely useless due to this too and I'm personally lost trying to sort out oos problems. I say try what you can and see if it works and if it does, I don't give a hoot how much longer it takes to process... much better a delay than a restart EVERY FRICKIN ROUND!!! like it's become.

I figured we'd have to wait for AIAndy to be back programming for a fix but if we get another specialist in this dept on the team I'm going to be absolutely ecstatic! I really only find this game truly enjoyable if its played in multiplayer anyhow.

As for 'tips', all I can really say is that if the connected computers were having differing random numbers its going to cause a lot of oos problems.

We were finding the biggest difference that seemed to emerge every round had to do with animal spawns. Either she'd have an animal on her game that wasn't on mine or I would have one that wasn't on hers. This led me to think it would've been simply an inability for the random results to stay constant between the connection somehow.

This must've changed because a few versions ago we did quite well with OOS errors - only had them when our units tried to route themselves through opponent territories without open borders. I know that issue was fixed so I was really baffled why we suddenly have such a host of OOS errors again.

Having an idea of what has been changed made me think this might have something to do with the viewport coding and how it interacts with spawns... but we hadn't done any log compares yet so what you're going on sounds like it may be a more accurate assessment.
 
While I can't directly try to debug this while I am away from my modding computer I can give you some pointers. First some general information:

The game code has parts that are async (they only happen on one computer) and some that are synced (they happen on all computers in the same order). Mainly user interaction (help texts and user input, including selecting stuff from popups) and graphic stuff are async.
In general async code may not change any synced data, especially not by using the synced RNG. If you need to make a change to game data or initiate commands and the like from async code, you use a message. That message is received by all computers (including the one sending the message) and it is synced.

So how to debug it when you get an OOS.
The OOS log shows a good part of the game state including the current RNG so you know what is actually different between the computers. If the RNG is the same, then you have to search for code that might touch the stuff that is different in the log. But keep in mind that any desync kind of snowballs. If an async code adds a unit, then that unit will be considered by the AI so it might make different decisions.
The worst kind of snowballing is when the synced RNG is used by async code as any random number after this one will be different now. So a common effect will be that different animals will spawn (that uses quite a lot of random numbers) and the like even when that is not the source of the problem.
Luckily for that there is the random log. You don't need to log the seed of the RNG on every time slice when you instead log all synced random numbers that are generated. That is what the random log is for: Logging every use of the synced RNG with a specific note/phrase (like "animal spawn") so you know which part of the code generated a random number here.
The first difference between the random logs is what counts (because after that all will be different). Usually you will see a random number generated only on one computer and then you can search the code for the phrase and check if it was used in async context like by using the active player in synced code or some Python code that uses active player for synced stuff or usage of an uninitialized variable or ...
Unfortunately it can be quite hard at times to find the actual issue (on the other hand sometimes it is easy).
 
You don't need to log the seed of the RNG on every time slice when you instead log all synced random numbers that are generated. That is what the random log is for: Logging every use of the synced RNG

thanks. The above explains it.

Sorry to dissapoint thunderbrd, I am looking to fix few oos i am getting in new dawn 1.74. It's the most stable version for me. We are able to play with it for 2-3 weeks straight in a single, continuos mp game (with oos once per 2-3 hours). I increased map size in the last game and started getting more frequent oos(1 every 6-10 turns), so I want to figure out what's causing it.
Even new dawn 1.75-2.0 have way too many oos for me to tackle them right now. My friends that i play mp with are very impatient, they want to play, not spend half of the game helping me debug it, so we are sticking with the most stable version. I do like ideas and improvements in the new versions and c2c, but i am not moving away from mp game stability. The problem is that there are so many massive changes and so few people playing mp games, so you get 100 changes and then someone plays mp game and finds something not working, hard to dig it out from 100 changes by different people. Once i get rid of the few oos that trouble our games, i most likely will port a change or 2 in new dawn, play another huge mp game, if it's stable, will add more and so on.
 
As I said... we'd already done a lot of debugging on OOS here in C2C and it was surprising to see the latest version having such a serious problem. But AIAndy's post made a lot of sense as to how one little difference can create a lot of issues. His post there MAY have helped me to start trying to work on some of these though so it may be a win-win nevertheless. Thanks Andy!
 
Top Bottom