[TUTORIAL] Fix and avoid OOS issues (DLL and python)

Nightinggale

Deity
Joined
Feb 2, 2009
Messages
5,281
One often overlooked aspect of modding is network stability and a number of mods are unplayable in network games. It's not surprising because the network documentation is close to non-existing.

Don't use vanilla

It turns out that vanilla actually has OOS issues, meaning for the best results, you should ideally start out with a mod, which already fixed the vanilla issues. K-mod for BTS and We the People for Colonization matches that description.

Parallel execution

The civ4 engine is based on parallel execution, meaning rather than communicating results online, each computer will calculate the same data in parallel. This means each computer will have to figure out what the AI is doing and so on instead of relying on a host to do the calculations.

This is a widespread solution to network games in general and it makes sense when you look into it. It lowers network traffic, something which is important for civ4, which was coded in an era where a number of the intended players were still using dialup. Furthermore the internet is slow. Reading a number from memory can easily be a million times faster than reading a number from another computer on the network even with low ping. High ping times makes the difference even greater. Ping time only causes lag for network traffic, meaning if something can be done without network communication, it won't lag even with horrible ping times.

The bad part of parallel execution is that OOS can become an issue. If the calculations somehow end up providing different results on two computers, they will no longer have the same game data and OOS will occur. OOS is unplayable because once two computers are out of sync, they can start to calculate differently for most calculations (particularly see random below), meaning the two games might no longer agree on where the AI is placing units, what the cities are producing etc, which quickly makes two very different games.

Savegames
When a player joins a network game, the game is paused, the host saves the game and transmit it to the new player. This means if some game data change by saving the game and then loading, those data can cause OOS. You need to make sure you load precisely the same game as you save. It doesn't matter if you save cached values or you recalculate those values on load, just as long as you get the same values as those players already in the game.

Synced vs asynced executed code
You need an awareness of what is calling the code. If the code is called in sync (like doTurn()), then you need to make sure all computers will do the same thing. Often this isn't a problem. You write code like if production this turn + production already used for building > production needed -> build building. That will be the same for all computers because all the input data is by definition the same for all. Same input, same formula results in same output.

Async code or local code is when something happens, which only happens on one computer. This is often UI related, like opening a city screen, moving a unit etc. The user click the mouse or press a key, which other players will not know about.

Working with synced code
Usually not a huge problem. You can do whatever you want as long as you get the same result everywhere. +1 food if plot produces at least 3 food is perfectly valid because the pre-bonus food production should be the same on all computers.

Things to look out for
Don't use system random. If the AI needs a random number from 0 to 50, it has to be the same number on all computers. However doing this without network communication enters the realm of what sounds like an oxymoron: predictable randomness. This is actually a field of research, but all you need to know is how to use it. Firaxis already wrote code to make the non-random numbers feel random.

Getting a random number is done by calling:
PHP:
GC.getGameINLINE().getSorenRandNum()
It's important to notice that whenever it's called, it will use the random seed to give itself a new random seed for the next call. This means if one computer calls it 8 times and another calls it 9 times, even code which calls it correctly will cause OOS because they now get random numbers based on different random seeds. In other words it only works if the code in question is indeed executed in sync.

While data might be the same everywhere, pointer values will never be the same. Don't use pointer values at all for anything other than what they are: pointers to data. One of the OOS fixes is vanilla sorting random events by pointer value and then at random pick the xth event. Since the pointers are different on all computers, the list will be different, hence the xth event will not be the same.

Working with async code
This is mostly just user input. Any time you write to anything, which is included in savegames (usually everything), you will cause OOS. Those data can only be written to in synced mode. You can however read all data. It's not a bad idea to make all pointers const in async mode, like you can read what you need from CvPlayer, but you can't alter the content.

Moving async code into sync
This is done by sending network messages. Say we want to change a working plot in a city. First a user starts this by clicking. Since it's a user input, it's async as it's only running on the computer, which is done with the following line:
PHP:
CvMessageControl::getInstance().sendDoTask(pCity->getID(), TASK_CHANGE_WORKING_PLOT, iIndex, -1, false, bAlt, bShift, bCtrl);

This will go through the CvMessage system, which in turn will forward the created message to the exe. The exe will then transmit data on the network and all computers will then call:
PHP:
void CvCity::doTask(TaskTypes eTask, int iData1, int iData2, bool bOption, bool bAlt, bool bShift, bool bCtrl)
The this pointer (the city) is pCity. This way we transferred the arguments from async execution to synced execution and we are free to write to synced data.

It should be noted that while the principle is the same, BTS and Colonization differs slightly here. Colonization doesn't have the two CvMessage classes in the DLL. Instead it uses the syntax gDLL->sendDoTask() and has all the send network functions in CvDLLUtilityIFaceBase.h.

BTS has send commands/data in:
CvDLLUtilityIFaceBase.h
CvMessageControl.h/cpp
CvMessageData.h/cpp

Python
CvEventManager.py has the ability to create network data as well. self.Events contains a number of events. Take for instance:
PHP:
CvUtil.EventEditCityName : ('EditCityName', self.__eventEditCityNameApply, self.__eventEditCityNameBegin),
It will call self.__eventEditCityNameBegin when created (should be async) where it opens a popup window. On ok clicked, it will call self.__eventEditCityNameApply in sync.

There is also a generic int network message, though you will likely prefer to use the C++ network messages if possible. It's a nice feature though as it allows python only mods to avoid OOS.
PHP:
    def onModNetMessage(self, argsList):
        'Called whenever CyMessageControl().sendModNetMessage() is called - this is all for you modders!'
      
        iData1, iData2, iData3, iData4, iData5 = argsList

What to do when the game causes OOS and you have no clue why

Knowing what to do to avoid OOS in the first place is good, but you will run into issues where the game causes an OOS for apparently no reason and you have no indications at all to why it happens. This requires network sync debugging tools and vanilla provided none of those. Luckily I started writing some here: https://github.com/We-the-People-civ4col-mod/Mod/commit/013bce446253d664a5d593c017e3288dd4791deb

This also fixes an OOS problem. It turned out to be caused by AI culture spread, meaning it just happened during doTurn when the AI cities had culture within a certain threshold. Since this happened in unexplored plots on the test savegame, searching ingame would likely never have uncovered this issue. It was triggered by culture spread discount (traits) was using currentPlayer instead of city owner. Current player is the one sitting at the computer. With different traits for human players, the AI would expand culture at different rates. This is most likely a worst case scenario for hunting causes for OOS, but it happened, meaning it's not unrealistic. It was a singleplayer bug too, but nobody had noticed the occasional incorrect AI culture spread.

How to set up your computer to run network games on the same computer
One way to test network games is to set up a network, but you can actually run multiple instances of BTS/Colonization on the same computer. Most modern computers have the hardware to handle it just fine. The civ4 engine usually use just a single CPU core, meaning modern 4+ CPU cores will just use more cores, hence not slowing down when running 2-3 instances in parallel.

What you do is create a (desktop) shortcut to the exe file. Right click and in target add the following
Code:
 mod="modname" multiple
Make sure it starts with a space. Modname is the folder name of your mod. Multiple disables the startup check, which rejects starting the game if the exe is already running.

Once you have multiple instance running, make one start a game and the rest join the game at 127.0.0.1. This number will always make a computer connect to itself.

You can use a debugger in network games, but hitting a breakpoint has a high risk of a timeout error. In other words the debugger can be used, but not as carelessly as in single player. It's also a good idea to start one instance, attach the debugger and then start other instances. That way you know which one the debugger is attached to.

How to use the OOS debug tool
First add the code to your DLL file. There is no easy way other than following the diff and apply yourself. Enable the button in the main interface (it's disabled by default for obvious reasons when you see it. It's an eyesore). Clicking the button will send a network message, which will make all computers call CvGame::writeDesyncLog(). This function will create a txt file and start to write to it. It's written to the base of your mod. The int value of current player is included in the filename, meaning it allows multiple players to save to the same mod. No need to add multiple instance of the mod for testing.

You add what data to write. It calls writeDesyncLog() in other classes and in the first version (from the link) it calls CvMap, which in turn calls it for all CvPlot. You can add whatever data you want. If you only write synced data, the files should be identical.

This means in short: load a game, get other player(s) to join. Play until desync. Click button and verify the files showed up.

How to examine the txt files
The idea is since they are identical, all you need to do is to locate lines, which differs and the differences are related to OOS issues. The easiest way to get the differences is to use a diff tool. I usually use SmartSynchronize because it's free and the GUI is good. Good in this case means easier and faster to use.

Chasing the cause of an OOS might require getting the txt files, locate data, which differs, figure out what else to write to the txt files and try again. It can take multiple runs to pin down the variable, which first goes OOS and then causes all the others to also go OOS. You only need to fix the cause, not all the results of the cause.
 
Top Bottom