How Did you Fix Out-of-Sync Errors?!

SJN

Prince
Joined
Nov 8, 2007
Messages
373
So, I'll shamefully admit that I haven't played K-Mod...

But from everything I've read, it sounds like this mod really fixed quite a few BTS multiplayer errors related to "out-of-sync." I wonder if any of the dev's could share how they went about finding and fixing these kinds of errors. One of the other mods I play (legends of revolution) is completely MP-broken because of OOS, and I'm tempted to go in and try to fix it myself. I'm a pretty advanced developer in both C++ and Python, but have minimal modding experience. Any tips or suggestions would be appreciated.
 
A bit late and while I haven't been involved in the fixes in question, I can give guidelines to how to fix OOS issues, or ideally avoid them from showing up in the first place. I have experience fixing OOS issues in We the People (previously Religion and Revolution) and can tell that there are currently no known OOS issues. I know it's Colonization, but it's the same engine as BTS, meaning there is no difference for the programmer (unless stated otherwise).

Civ4 is based on parallel execution, meaning instead of having a host calculate everything and transmitting it online, all computers will do the same calculations, ideally without any network activity at all. It's great for keeping bandwidth usage down (it was coded back when a significant number of people still used dialup). The issue is that if say deciding which plot to move the AI unit to is not calculated to be the same on all computers, OOS will occur.

Savegames
When a player joins a network game, the game is paused, the host saves the game and transmit it to the new player. This means if some game data change by saving the game and then loading, those data can cause OOS. You need to make sure you load precisely the same game as you save. It doesn't matter if you save cached values or you recalculate those values on load, just as long as you get the same values as those players already in the game.

Synced vs asynced executed code
You need an awareness of what is calling the code. If the code is called in sync (like doTurn()), then you need to make sure all computers will do the same thing. Often this isn't a problem. You write code like if production this turn + production already used for building > production needed -> build building. That will be the same for all computers because all the input data is by definition the same for all. Same input, same formula results in same output.

Async code or local code is when something happens, which only happens on one computer. This is often UI related, like opening a city screen, moving a unit etc. The user click the mouse or press a key, which other players will not know about.

Working with synced code
Usually not a huge problem. You can do whatever you want as long as you get the same result everywhere. +1 food if plot produces at least 3 food is perfectly valid because the pre-bonus food production should be the same on all computers.

Things to look out for
Don't use system random. If the AI needs a random number from 0 to 50, it has to be the same number on all computers. However doing this without network communication enters the realm of what sounds like an oxymoron: predictable randomness. This is actually a field of research, but all you need to know is how to use it. Firaxis already wrote code to make the non-random numbers feel random.

Getting a random number is done by calling:
PHP:
GC.getGameINLINE().getSorenRandNum()
It's important to notice that whenever it's called, it will use the random seed to give itself a new random seed for the next call. This means if one computer calls it 8 times and another calls it 9 times, even code which calls it correctly will cause OOS because they now get random numbers based on different random seeds. In other words it only works if the code in question is indeed executed in sync.

While data might be the same everywhere, pointer values will never be the same. Don't use pointer values at all for anything other than what they are: pointers to data. One of the OOS fixes is vanilla sorting random events by pointer value and then at random pick the xth event. Since the pointers are different on all computers, the list will be different, hence the xth event will not be the same.

Working with async code
This is mostly just user input. Any time you write to anything, which is included in savegames (usually everything), you will cause OOS. Those data can only be written to in synced mode. You can however read all data. It's not a bad idea to make all pointers const in async mode, like you can read what you need from CvPlayer, but you can't alter the content.

Moving async code into sync
This is done by sending network messages. Say we want to change a working plot in a city. First a user starts this by clicking. Since it's a user input, it's async as it's only running on the computer, which is done with the following line:
PHP:
CvMessageControl::getInstance().sendDoTask(pCity->getID(), TASK_CHANGE_WORKING_PLOT, iIndex, -1, false, bAlt, bShift, bCtrl);

This will go through the CvMessage system, which in turn will forward the created message to the exe. The exe will then transmit data on the network and all computers will then call:
PHP:
void CvCity::doTask(TaskTypes eTask, int iData1, int iData2, bool bOption, bool bAlt, bool bShift, bool bCtrl)
The this pointer (the city) is pCity. This way we transferred the arguments from async execution to synced execution and we are free to write to synced data.

It should be noted that while the principle is the same, BTS and Colonization differs slightly here. Colonization doesn't have the two CvMessage classes in the DLL. Instead it uses the syntax gDLL->sendDoTask() and has all the send network functions in CvDLLUtilityIFaceBase.h.

BTS has send commands/data in:
CvDLLUtilityIFaceBase.h
CvMessageControl.h/cpp
CvMessageData.h/cpp

Python
CvEventManager.py has the ability to create network data as well. self.Events contains a number of events. Take for instance:
PHP:
CvUtil.EventEditCityName : ('EditCityName', self.__eventEditCityNameApply, self.__eventEditCityNameBegin),
It will call self.__eventEditCityNameBegin when created (should be async) where it opens a popup window. On ok clicked, it will call self.__eventEditCityNameApply in sync.

There is also a generic int network message, though you will likely prefer to use the C++ network messages if possible. It's a nice feature though as it allows python only mods to avoid OOS.
PHP:
    def onModNetMessage(self, argsList):
        'Called whenever CyMessageControl().sendModNetMessage() is called - this is all for you modders!'
       
        iData1, iData2, iData3, iData4, iData5 = argsList
 
What to do when the game causes OOS and you have no clue why

Knowing what to do to avoid OOS in the first place is good, but you will run into issues where the game causes an OOS for apparently no reason and you have no indications at all to why it happens. This requires network sync debugging tools and vanilla provided none of those. Luckily I started writing some here: https://github.com/We-the-People-civ4col-mod/Mod/commit/013bce446253d664a5d593c017e3288dd4791deb

This also fixes an OOS problem. It turned out to be caused by AI culture spread, meaning it just happened during doTurn when the AI cities had culture within a certain threshold. Since this happened in unexplored plots on the test savegame, searching ingame would likely never have uncovered this issue. It was triggered by culture spread discount (traits) was using currentPlayer instead of city owner. Current player is the one sitting at the computer. With different traits for human players, the AI would expand culture at different rates. This is most likely a worst case scenario for hunting causes for OOS, but it happened, meaning it's not unrealistic. It was a singleplayer bug too, but nobody had noticed the occasional incorrect AI culture spread.

How to set up your computer to run network games on the same computer
One way to test network games is to set up a network, but you can actually run multiple instances of BTS/Colonization on the same computer. Most modern computers have the hardware to handle it just fine. The civ4 engine usually use just a single CPU core, meaning modern 4+ CPU cores will just use more cores, hence not slowing down when running 2-3 instances in parallel.

What you do is create a (desktop) shortcut to the exe file. Right click and in target add the following
Code:
 mod="modname" multiple
Make sure it starts with a space. Modname is the folder name of your mod. Multiple disables the startup check, which rejects starting the game if the exe is already running.

Once you have multiple instance running, make one start a game and the rest join the game at 127.0.0.1. This number will always make a computer connect to itself.

You can use a debugger in network games, but hitting a breakpoint has a high risk of a timeout error. In other words the debugger can be used, but not as carelessly as in single player. It's also a good idea to start one instance, attach the debugger and then start other instances. That way you know which one the debugger is attached to.

How to use the OOS debug tool
First add the code to your DLL file. There is no easy way other than following the diff and apply yourself. Enable the button in the main interface (it's disabled by default for obvious reasons when you see it. It's an eyesore). Clicking the button will send a network message, which will make all computers call CvGame::writeDesyncLog(). This function will create a txt file and start to write to it. It's written to the base of your mod. The int value of current player is included in the filename, meaning it allows multiple players to save to the same mod. No need to add multiple instance of the mod for testing.

You add what data to write. It calls writeDesyncLog() in other classes and in the first version (from the link) it calls CvMap, which in turn calls it for all CvPlot. You can add whatever data you want. If you only write synced data, the files should be identical.

This means in short: load a game, get other player(s) to join. Play until desync. Click button and verify the files showed up.

How to examine the txt files
The idea is since they are identical, all you need to do is to locate lines, which differs and the differences are related to OOS issues. The easiest way to get the differences is to use a diff tool. I usually useSmartSynchronize because it's free and the GUI is good. Good in this case means easier and faster to use.

Chasing the cause of an OOS might require getting the txt files, locate data, which differs, figure out what else to write to the txt files and try again. It can take multiple runs to pin down the variable, which first goes OOS and then causes all the others to also go OOS. You only need to fix the cause, not all the results of the cause.
 
Last edited:
Top Bottom