OOS problems and how to fix them

That is the pipeline multithreading Koshling added. Each pipeline work item gets its own rand seed so it does not matter if the order between different work items is swapped.
The names of the pipeline work items in this case are Cahokia and Poverty Point. If a log entry is not from a pipeline, you get Global at that point.
All log entries that have the same name at that point have to be in fixed order but different names can be swapped without problem (and you will see that frequently).

Sorry AIAndy but I'm not sure I get it: are you saying that this was not an OOS, so maybe the OOS root was further down in differences found in the log? Is it correct that I see this kind of difference in the logs and the OOS is somewhere else?


One suspicious thing I found was that AI_TechValueCached, which is on the way to "AI Research" can be called both sync and async, given by the bool bAsync that is passed to it. But it caches values in the same cache in both cases. So if for the same player that function is called in both sync and async context, it will likely go OOS.
So I would highly suggest changing
Code:
{
	MEMORY_TRACK_EXEMPT()
	m_cachedTechValues[eTech] = iValue;
}
to
Code:
if (!bAsync)
{
	MEMORY_TRACK_EXEMPT()
	m_cachedTechValues[eTech] = iValue;
}

Great, I can try this one; but have you come to this conclusion from one of the above reports (nevermind, I got it)? If so, I can test the new code and see if that particular OOS log appears again. Thank you. :)
 
Sorry AIAndy but I'm not sure I get it: are you saying that this was not an OOS, so maybe the OOS root was further down in differences found in the log? Is it correct that I see this kind of difference in the logs and the OOS is somewhere else?
He's saying it shouldn't be considered an error. Due to multi-threading (processing never before seen in CivIV but very common on more modern softwares) which uses more than one processing core to get to conclusions faster for the computer, there was a necessity to allow some things that show up on the logs there to take place out of order on each computer. Theoretically, these checks are taking place out of order but each system should be brought back into a synched state. So you'll see those but they don't necessarily mean they represent a real OOS origin point. Particularly when, even though those checks are out of order, they still have the same values and synchronization shows to be solid even after this shuffling takes place.
 
He's saying it shouldn't be considered an error. Due to multi-threading (processing never before seen in CivIV but very common on more modern softwares) which uses more than one processing core to get to conclusions faster for the computer, there was a necessity to allow some things that show up on the logs there to take place out of order on each computer. Theoretically, these checks are taking place out of order but each system should be brought back into a synched state. So you'll see those but they don't necessarily mean they represent a real OOS origin point. Particularly when, even though those checks are out of order, they still have the same values and synchronization shows to be solid even after this shuffling takes place.
Correct :)
 
That is the pipeline multithreading Koshling added. Each pipeline work item gets its own rand seed so it does not matter if the order between different work items is swapped.
The names of the pipeline work items in this case are Cahokia and Poverty Point. If a log entry is not from a pipeline, you get Global at that point.
All log entries that have the same name at that point have to be in fixed order but different names can be swapped without problem (and you will see that frequently).


That means that because of some if condition different paths were followed on both computers.
Now you backtrack from both positions and try to find where the point could be on which they were joined and what condition and values would have caused them to split. If you are lucky you find one of the standard OOS situations.

One suspicious thing I found was that AI_TechValueCached, which is on the way to "AI Research" can be called both sync and async, given by the bool bAsync that is passed to it. But it caches values in the same cache in both cases. So if for the same player that function is called in both sync and async context, it will likely go OOS.
So I would highly suggest changing
Code:
{
	MEMORY_TRACK_EXEMPT()
	m_cachedTechValues[eTech] = iValue;
}
to
Code:
if (!bAsync)
{
	MEMORY_TRACK_EXEMPT()
	m_cachedTechValues[eTech] = iValue;
}

We should also add this fix to Caveman2Cosmos.
 
He's saying it shouldn't be considered an error. Due to multi-threading (processing never before seen in CivIV but very common on more modern softwares) which uses more than one processing core to get to conclusions faster for the computer, there was a necessity to allow some things that show up on the logs there to take place out of order on each computer. Theoretically, these checks are taking place out of order but each system should be brought back into a synched state. So you'll see those but they don't necessarily mean they represent a real OOS origin point. Particularly when, even though those checks are out of order, they still have the same values and synchronization shows to be solid even after this shuffling takes place.

Ok, but wait a minute: I get an OOS error, I check the logs and see these differences caused by multithreading scattered all over the place in the Random Log. But after I check all of them, I see no difference aside from their order. So what caused that OOS? If it's not important their order so long as entries have the same values, there should be something else in the random log pointing at the OOS cause, correct? But from what I've seen my logs are identical for both players (beside the order of entries, as I said).
 
We should also add this fix to Caveman2Cosmos.

Definetely; I'm testing now AND with that fix in multiplayer and we had none so far after about 100 turns (usually we don't go beyond 40 turns without OOS and then it get worse, going OOS almost every couple of turns).

Edit: to avoid OOS as much as possible, I've disabled multithreading in my current game. Also I'm using AI pathing system under BUG options. I'm not sure but I think that having players units moving in a different way than AI units is causing troubles. Right now 180 turns and only 2 OOS, and one was almost certainly caused by me doing something wrong (i.e. building the same wonder in 2 cities at the same time).
 
We should also add this fix to Caveman2Cosmos.
I committed the change now.

45°38'N-13°47'E;13026538 said:
Ok, but wait a minute: I get an OOS error, I check the logs and see these differences caused by multithreading scattered all over the place in the Random Log. But after I check all of them, I see no difference aside from their order. So what caused that OOS? If it's not important their order so long as entries have the same values, there should be something else in the random log pointing at the OOS cause, correct? But from what I've seen my logs are identical for both players (beside the order of entries, as I said).
Was the random seed in the OOS logs actually different?
 
Was the random seed in the OOS logs actually different?

Do you mean Next Map Rand Value or Next Soren Rand Value? No, they were both identical for both players. But they're not always different when the game goes OOS from what I've seen.
 
45°38'N-13°47'E;13027467 said:
Do you mean Next Map Rand Value or Next Soren Rand Value? No, they were both identical for both players. But they're not always different when the game goes OOS from what I've seen.
When they are the same then the random log won't help as it will be the same except for the swaps from multithreading.
But in the OOS log something else should be different then.
 
When they are the same then the random log won't help as it will be the same except for the swaps from multithreading.
But in the OOS log something else should be different then.

Sure, here are the differences:

Code:
Unit Info:
----------
Player 0, Unit ID: 24577, Worker 1 (St. Petersburg)
X: 85, Y: 51
Damage: 0
Experience: 6
Level: 1
Promotions:

Player 0, Unit ID: 16386, Javelineer 2 (Moscow)


Code:
Unit Info:
----------
Player 0, Unit ID: 24577, Worker
X: 85, Y: 51
Damage: 0
Experience: 6
Level: 1
Promotions:

Player 0, Unit ID: 16386, Javelineer


And it goes on for a few lines; looks like the name of the city originating units is different (not displayed for either of the players)
 
45°38'N-13°47'E;13027519 said:
Sure, here are the differences:

Code:
Unit Info:
----------
Player 0, Unit ID: 24577, Worker 1 (St. Petersburg)
X: 85, Y: 51
Damage: 0
Experience: 6
Level: 1
Promotions:

Player 0, Unit ID: 16386, Javelineer 2 (Moscow)


Code:
Unit Info:
----------
Player 0, Unit ID: 24577, Worker
X: 85, Y: 51
Damage: 0
Experience: 6
Level: 1
Promotions:

Player 0, Unit ID: 16386, Javelineer


And it goes on for a few lines; looks like the name of the city originating units is different (not displayed for either of the players)
Name differences don't matter as they are local only.
I guess comparison would be easier if we would not include the names at all but then it is sometimes useful to know which unit it actually was if one is different.

In other words, ignore the names and look for other differences in the OOS log.
 
We should also add this fix to Caveman2Cosmos.

I already had in my code so it was destined to be included but looks like it's been committed already. :mischief:
 
Name differences don't matter as they are local only.
I guess comparison would be easier if we would not include the names at all but then it is sometimes useful to know which unit it actually was if one is different.

In other words, ignore the names and look for other differences in the OOS log.

So this is to ignore too, I suppose

Code:
Player 5, Unit ID: 32779, Javelineer
X: 26, Y: 40
Damage: 0
Experience: 0
Level: 1
Promotions:

Player 5, Unit ID: 24588, Worker
X: 33, Y: 42
Damage: 0
Experience: 5
Level: 1
Promotions:

Code:
Player 5, Unit ID: 24587, Worker
X: 33, Y: 42
Damage: 0
Experience: 5
Level: 1
Promotions:

Player 5, Unit ID: 32780, Javelineer
X: 26, Y: 40
Damage: 0
Experience: 0
Level: 1
Promotions:

Then the only different kind of difference in that log is

Code:
Player 11 Food Total Yield: 30
Player 11 Production Total Yield: 12
Player 11 Commerce Total Yield: 17


Commerce:
---------
Player 11 Gold Total Commerce: 0
Player 11 Research Total Commerce: 13

against

Code:
Player 11 Food Total Yield: 28
Player 11 Production Total Yield: 14
Player 11 Commerce Total Yield: 16


Commerce:
---------
Player 11 Gold Total Commerce: 0
Player 11 Research Total Commerce: 12

But then I wonder why there was no difference in the random log. :confused:
 
Not all OOS errors will show a difference in the random log. What you DO have is an indication that something probably took place differently in the ai on which plot the city decided to work. I've noticed we have an OOS there before. Perhaps start a search through how the AI determines which plots to work.

If this is a caching issue it's one that's beyond me but I've seen AIAndy fix similar OOS errors previously. I'm sure he'll have more advice there and I'll be watching for it too.
 
45°38'N-13°47'E;13028083 said:
Then the only different kind of difference in that log is

Code:
Player 11 Food Total Yield: 30
Player 11 Production Total Yield: 12
Player 11 Commerce Total Yield: 17


Commerce:
---------
Player 11 Gold Total Commerce: 0
Player 11 Research Total Commerce: 13

against

Code:
Player 11 Food Total Yield: 28
Player 11 Production Total Yield: 14
Player 11 Commerce Total Yield: 16


Commerce:
---------
Player 11 Gold Total Commerce: 0
Player 11 Research Total Commerce: 12

But then I wonder why there was no difference in the random log. :confused:
That looks like the working plot assignment on both computers might have been different which would not have caused any difference in the random log (at least not at once).

One thing I found that might have something to do with that is that in revision 5843, Koshling added a check if the city screen is up to CvCityAI::AI_updateAssignWork. But city screen up is a local UI state, not a synced state, so this can easily cause an OOS if one human player has the city screen up and the other does not.
So I will change that so it is only done in single player.
 
That looks like the working plot assignment on both computers might have been different which would not have caused any difference in the random log (at least not at once).

One thing I found that might have something to do with that is that in revision 5843, Koshling added a check if the city screen is up to CvCityAI::AI_updateAssignWork. But city screen up is a local UI state, not a synced state, so this can easily cause an OOS if one human player has the city screen up and the other does not.
So I will change that so it is only done in single player.

As the player that's out of sync is player 11, does this mean that if either player had the city screen up while the AI player (I presume player 11 would be an AI) is making a plot determination then that could lead to the OOS scenario we see here?
 
As the player that's out of sync is player 11, does this mean that if either player had the city screen up while the AI player (I presume player 11 would be an AI) is making a plot determination then that could lead to the OOS scenario we see here?
Hmm, actually there is an isHuman check in that line so if player 11 is an AI then there might be another OOS issue hidden in the plot assignment code.
 
Ah, that was already found but no one bothered to fix it in C2C ...
So that means another issue in the plot assignment somewhere.

Anyway that thing with cachedTechValues was critical; 223 turns now (with AND), well into medieval era with blitz speed (600 turns total) and yet no OOS. Never gone so far since years (I guess before AND 1.75, which I think is before C2C started). Thank you so much AIAndy. :goodjob:
 
Back
Top Bottom