Modder's guide to multiplayer safe code

AIAndy

Deity
Joined
Jun 8, 2011
Messages
3,428
In this post I will give some detailed information about how the multiplayer synchronization works in Civ4 and how to avoid and find OOS errors.
Everyone dealing with C++ or Python code or importing either needs to keep this in mind.

How does the Civ4 synchronization model work?
The full gamestate is kept on all computers at the same time. That is all information that has a direct effect on gameplay like the unit and city positions, what a city currently produces and what each player researches. It also includes the full state of each AI.
To keep it that way, all code that changes the gamestate must run on all computers in the same order and it must be deterministic. This also means that code that changes the gamestate only depends on the synchronized gamestate.
So if S is the synchronized gamestate and F and G are functions that represent a change of the gamestate, then all computers must run F and G in the same order and the result G(F(S)) is synchronized as well.

Synchronization and Random Numbers
So how do you keep the code deterministic but still have random numbers? Well, pseudo random number generators actually generate a deterministic sequence of numbers depending on the seed. So if all computers start with the same seed they will get the same random numbers. That means the state of the RNG is part of the gamestate. It is synchronized. That also means using it changes the gamestate so it may only be used in code that runs on all computers, in synchronized code.
There is also an async RNG which you can use to generate random numbers in code that is not synced.

Which code is synchronized?
In general all code that changes the game state is synchronized, all code that deals with the user interface is not. There are some functions that are used both in async and in sync context. In which context they are called is determined by a bool that is passed so you need to keep that in mind and use the right RNG depending on that bool.
Pretty much all synchronized code starts either from CvGame::update or from messages.

How to initiate synced code from async code and vice versa
Async code is often started from user input like pressing a button or the result of a popup. If you only want to display some information screen, that can be done directly in async code. But lets assume pressing that button should actually initiate a command on the unit which has a gameplay effect. Then you need to start synced code from the async user input. That is done with messages. A message can contain any amount of information, which is serialized and passed to all computers, including the one sending the message. The Civ4 exe makes sure that all messages are handled on all computers in the same order so the code executed from messages is synced.
Sometimes you want to change the UI or camera on one computer only starting from synced code. In that case you can simply use getActivePlayer in CvGame which will be different on the different computers so "if (GC.getGameINLINE().getActivePlayer() == iPlayer) doSomething()" will execute doSomething on one computer only.

Caches and Synchronization
For performance reasons a lot of expensive information is cached. If you add a cache, you can choose between two synchronization variants.
Variant 1 is to consider the cache state synchronized. That means the restriction that only synchronized code may change the cache state applies. Async code may read from it, but if it is a cache miss, the result of the uncached calculation may not be added to the cache.
Variant 2 is to consider the query result synchronized. That means that the cached result and the result of the uncached calculation must be equal at all times. So if anything the calculation is based on changes, the cache entry must be marked as dirty.

Multi-Threading and Synchronization
To add multiple threads to parts of the code you usually start them at some point and then join them to the main thread at a later point. There is no guarantee on the execution order in between the threads so one might be executed entirely before the others or completely mixed. To keep the game state synchronization, the result of the threads must be deterministic and must not depend on the execution order.
This will often be the case if you are just gathering information or valueing buildings and the like with one important problem: the RNG. Using the synced RNG changes the game state so any use of it in the threads would make the result non deterministic.
You can solve this using this pattern: Generate one random number in the main thread for each worker thread. Use that as the seed of a thread-specific RNG. Now the threads can use their specific RNGs and the result will be deterministic again.

Finding out of sync errors
Out of sync errors (OOS) are very difficult to find without a good OOS report from a user. The main tools are the OOS log and the Random log. Both are generated on each computer participating in a multiplayer game if the respective INI values are set. The important information is in the difference between the logs. The OOS log contains a good part of the game state in text form that is also used to generate a checksum to recognize out of sync states. The Random log contains all uses and results of the synced RNG.
If the OOS log shows that the state of the random number generator is different, then you can look into the Random log to see at which point it started to be different. A common source is that the RNG is used in async context which will result in a RNG usage that can only be seen in one log. Use a program like the merge program in Tortoise to compare the different logs and when you have found the first different random number, then search with a program like Windows Grep where the log phrase of the RNG usage is in the code files.
Mind in general that a state out of sync snowballs, especially a bad usage of the RNG as all following random numbers will be different as well so it can sometimes be hard to find the exact root cause of an out of sync as a lot of other things have gone out of sync before it is recognized.
An example could be that a promotion is set from async context which gives a unit double movement on forest which means it generates a different path and therefore might fight against different animals and you might see a subdued animal generated on one computer but the root cause was actually a bad promotion.
 
Multi-Threading and Synchronization
To add multiple threads to parts of the code you usually start them at some point and then join them to the main thread at a later point. There is no guarantee on the execution order in between the threads so one might be executed entirely before the others or completely mixed. To keep the game state synchronization, the result of the threads must be deterministic and must not depend on the execution order.
This will often be the case if you are just gathering information or valueing buildings and the like with one important problem: the RNG. Using the synced RNG changes the game state so any use of it in the threads would make the result non deterministic.
You can solve this using this pattern: Generate one random number in the main thread for each worker thread. Use that as the seed of a thread-specific RNG. Now the threads can use their specific RNGs and the result will be deterministic again.

...provided the work the multiple threads are doing does not interact with intermediate results of one another. For example, consider trying to parallelize unit movement. You'd have to (for example) adds locks to plots (or groups of plots) to ensure that movement was legal (e.g. - stack limits) if two units were considering the same plot. However, the order in which the processing reached that plot might cause different outcomes if unit A's presence prevents that of unit B (and visa versa).

Interactions between the thread workloads like the above example are much harder to handle, and may preclude more aggressive forms of multi-threading of parts of the code. Unit movement is probably the best example of this - it should be fairly readily possible to evaluate different option for ONE unit in parallel, but trying to process two entirely different units at once will be very hard to reconcile with the Civ synchronization model. The problem then is that just evaluating different options in parallel (for one unit at a time) may well not lead to much benefit (you wind up evaluating lots of things that a serial evaluation wouldn't have bothered with because a high priority action turns out to be executable quickly). This may even slow things down (lots of threads doing work that is largely thrown away will reduce the turbo core speedups that are available with a single thread, and if that is actually the critical path the net result will be slower processing then just doing I all serially).

I suspect that we'll wind up implementing things so that certain elements (unit movement in the AI is the one I most have in mind) will be parallelized only for single player games, and run serially if in a multi-player game.
 
...provided the work the multiple threads are doing does not interact with intermediate results of one another. For example, consider trying to parallelize unit movement. You'd have to (for example) adds locks to plots (or groups of plots) to ensure that movement was legal (e.g. - stack limits) if two units were considering the same plot. However, the order in which the processing reached that plot might cause different outcomes if unit A's presence prevents that of unit B (and visa versa).

Interactions between the thread workloads like the above example are much harder to handle, and may preclude more aggressive forms of multi-threading of parts of the code. Unit movement is probably the best example of this - it should be fairly readily possible to evaluate different option for ONE unit in parallel, but trying to process two entirely different units at once will be very hard to reconcile with the Civ synchronization model. The problem then is that just evaluating different options in parallel (for one unit at a time) may well not lead to much benefit (you wind up evaluating lots of things that a serial evaluation wouldn't have bothered with because a high priority action turns out to be executable quickly). This may even slow things down (lots of threads doing work that is largely thrown away will reduce the turbo core speedups that are available with a single thread, and if that is actually the critical path the net result will be slower processing then just doing I all serially).

I suspect that we'll wind up implementing things so that certain elements (unit movement in the AI is the one I most have in mind) will be parallelized only for single player games, and run serially if in a multi-player game.
Anything that is not deterministic WILL break the synchronization. So in most cases where you would use locks on gamestate it won't work in multiplayer.
Of course you have some different possibilities for parallelization in multiplayer as there are multiple computers that can compute different things (async) and then message the result.
 
Ok. Might I ask what 'deterministic' actually means in this regard?

Honestly, I'm sure this is a really well detailed explanation, but it makes me feel like a child trying to understand adults talking about stockbrokering. In this analogy, as a child, I'm human and can get much of what's being said but the entire message is way too confusing to my current knowledge base to grasp.
 
Ok. Might I ask what 'deterministic' actually means in this regard?

Honestly, I'm sure this is a really well detailed explanation, but it makes me feel like a child trying to understand adults talking about stockbrokering. In this analogy, as a child, I'm human and can get much of what's being said but the entire message is way too confusing to my current knowledge base to grasp.
Deterministic means that if you run the code on a specific game state then the result will always be the same. That makes sure that if the game state is synchronized before between the computers, then it will also be synchronized afterwards because the result of the code is the same.

http://en.wikipedia.org/wiki/Deterministic_algorithm
 
Top Bottom