1. We have added a Gift Upgrades feature that allows you to gift an account upgrade to another member, just in time for the holiday season. You can see the gift option when going to the Account Upgrades screen, or on any user profile screen.
    Dismiss Notice

OOS problems and how to fix them

Discussion in 'Bugs and Crashes' started by 45°38'N-13°47'E, Jan 20, 2014.

  1. 45°38'N-13°47'E

    45°38'N-13°47'E Chieftain

    Joined:
    Jun 7, 2008
    Messages:
    5,833
    Location:
    Just wonder...
    Hello guys, I've recently tried some multiplayer games with my wife with A New Dawn which, as you know, uses most of the C2C dll; but after a few turns it becomes impossible to play due to OOS almost every turn. I've had a look in your forum and I've seen that the same applies for those brave enough to try a MP game with C2C (not a PBEM game, of course, which requires more patience than bravery ;) ).
    Now, I'd really like to try to fix some of those OOS bugs but I'm not sure where to start from. I mean, I have or I can have tons of logs but no experience in OOS bug fixing. Is there any help that any of you can give me on how to debug an OOS error? I guess I can't even tell the source of an OOS error, either dll or python problems.
    I'm asking help here because i know that some OOS bugs have been fixed in C2C before, I think maybe AIAndy was the OOS guru, correct? And since AND shares a good part of the dll with C2C, if I'm able to track some OOS bugs in AND, it could help you here too; and it should be easier to track bugs in AND because it has less features than C2C. AND dll is very similar to C2C dll, the only things I've left out completely when importing code in AND were Traits and TB Combat Mod. I hope I can get some help or some advice from you guys. :)

    P.S. I've tried enabling/disabling multithreading and it hasn't done any difference in terms of OOS. I remember some months ago AND was pretty stable in MP, so I think I could repeat the import process from C2C to AND dll starting from the last OOS-free revision of AND to check were problems started. I hope I can get some help and help you at the same time.
     
  2. Thunderbrd

    Thunderbrd C2C War Dog

    Joined:
    Jan 2, 2010
    Messages:
    25,316
    Gender:
    Male
    Location:
    Las Vegas
    You're right that these are BY FAR the HARDEST bugs to find and fix. I'm really going to have to get into figuring out where some of the ones we've got reports are and repair them... There's some tutorials around the site and I've read them but I don't think they're entirely complete.

    AIAndy wrote a good tutorial on that here (a thread with something like: The Modder's Guide to OOS Error debugging or something like that.) But at the time he wrote it I wasn't quite as good at following him. And while I've figured out how to fix SOME OOS errors, there's GOT to be something I don't know because I've been scouring the combat mechanism for months trying to find the one I know must be there somewhere.

    Anyhow, if I find any of those tutorials I mentioned I'll post them here for you and if either of us finds any fixes then we definitely need to share them! ;)
     
  3. 45°38'N-13°47'E

    45°38'N-13°47'E Chieftain

    Joined:
    Jun 7, 2008
    Messages:
    5,833
    Location:
    Just wonder...
    I guess it's this one:
    http://forums.civfanatics.com/showthread.php?t=477472
    Anyway I'll start looking at some of my logs to see if I can find something.
    I've made some quick test and I've seen that the game OOS even if you connect 2 pc via LAN and you let one of them autoplay, while playing with the other. This could speed up the process of collecting logs. I'll report back here if I'm able to find something or if I need some more direction. :)
    I've found another extensive guide here:
    http://forums.civfanatics.com/showthread.php?t=188460
    But I haven't read it all yet.
     
  4. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    I completely forgot that I wrote that back then.
    That extensive guide you found there should be far easier to understand though (looks very nice).

    Anyway, if you want to make debugging OOS easier, I would recommend adding more to CvGame::calculateSyncChecksum. It contains mainly high level stuff that should find a lot of the OOS in the basic game mechanics but the C2C DLL has a LOT more that is not covered there.
    In principle just calculate any synchronized part of the game state into that checksum that you suspect might get desynced at some point. That way the OOS might be detected closer to the root cause.
     
  5. 45°38'N-13°47'E

    45°38'N-13°47'E Chieftain

    Joined:
    Jun 7, 2008
    Messages:
    5,833
    Location:
    Just wonder...
    Thank you AIAndy, it looks like I've got a lot to learn... but I'm willing to try. AND or C2C multiplayer would be really awesome.
     
  6. Thunderbrd

    Thunderbrd C2C War Dog

    Joined:
    Jan 2, 2010
    Messages:
    25,316
    Gender:
    Male
    Location:
    Las Vegas
    Yeah, those are the two I was referring to that I've tried to learn from - its still tough though.
    I'm going to have to really look into this function to see what you're trying to say but anything that will help to nail these down will be awesome. This is such a frustrating issue - if I could reasonably take a whole cycle out to fix them all I would but I suck so bad at finding them that it gives me a headache! :mad:
     
  7. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    OOS errors can be very frustrating to find indeed.

    The checksum function looks more complicated than it is. It just tries to add as much of the gamestate to the checksum as it can. The math you use to add them in does not matter. Those weird multiplications just make sure that something like +1 total beakers and -1 total food does not add up to the same number. Apart from that to save time it does not always add in everything but depending on the current time slice it chooses a different subset. Might be worth it to change that to always add in everything.
    The large OOS guide also has information about the checksum at the end of section 1.
    Btw, if you add something to the checksum, you should probably add it to OOSLogger.py as well so you don't only know as early as possible that it has desynced but also why.
     
  8. 45°38'N-13°47'E

    45°38'N-13°47'E Chieftain

    Joined:
    Jun 7, 2008
    Messages:
    5,833
    Location:
    Just wonder...
    Ok, I have to learn how to modify that checksum part.
    But just to make sure I understand your post explaining how to detect and fix OOS errors, this is what I've done right now:
    - played a game up to the first OOS (both computer are logging everything);
    - compare the logs: OOSLogs show differences from the very first line (Next Map Rand Value, Next Soren Rand Value, Total population, and others following);
    - compare RandomLogger; the first entry which is different is

    Code:
    524353	103	Global	AI Maximum War	200	145
    against

    Code:
    524353	103	Global	AI Maximum War	100	72

    So I search for "AI Maximum War" in the dll code and come up with a single line in cvTeamAI.cpp:

    Code:
    if ((bFinancesProTotalWar || !bFinancesOpposeWar) &&
    	(GC.getGameINLINE().getSorenRandNum(iTotalWarRand, "AI Maximum War") <= iTotalWarThreshold))
    Does it mean that the problem is here?
     
  9. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    It means that your search for the root cause starts there. You know that at this point iTotalWarRand was 200 on one computer and 100 on the other. Now you backtrack from this point to find out what the value of iTotalWarRand depends on.
    This will lead you to CvTeamAI::AI_getWarRands. That is when it starts to get a larger and larger search. Several local variables influence it that depend on the victory strategies. Since one iTotalWarRand is twice that of the other, it might be caused by a divergence in AI_isDoVictoryStrategy(AI_VICTORY_CULTURE3).
    The problem is: There are a lot of dependencies to search and you have to make good guesses to cull that search tree down to manageable proportions.
     
  10. 45°38'N-13°47'E

    45°38'N-13°47'E Chieftain

    Joined:
    Jun 7, 2008
    Messages:
    5,833
    Location:
    Just wonder...
    I see; I've got to the point of AI_getWarRands before posting, so I guess I'm starting to understand. From what I can tell from the code, AI_getWarRands depends from AI_maxWarRand, AI_limitedWarRand and AI_dogpileWarRand; it then depends on those following multipliers, depending on AI_isDoVictoryStrategy. So I look for every entry of AI_VICTORY_CULTURE3 and I'm lucky enough: only 27 of them. I guess I can discard those entries where I see IF... AI_VICTORY_CULTURE3 .... because those shouldn't affect AI_VICTORY_CULTURE3 value. I've checked those entries anyway just to make sure, and they shouldn't affect that value. This leaves me with only a handful of entries; what I've noted is

    in CvCityAI.cpp

    Code:
    int CvCityAI::getBuildingCommerceValue(BuildingTypes eBuilding, int iI, int* aiFreeSpecialistYield, int* aiFreeSpecialistCommerce, int* aiBaseCommerceRate, int* aiPlayerCommerceRate)
    {
    	int iResult = 0;
    	int iJ;
    	CvBuildingInfo& kBuilding = GC.getBuildingInfo(eBuilding);
    	CvPlayerAI& kOwner = GET_PLAYER(getOwnerINLINE());
    	int iLimitedWonderLimit = limitedWonderClassLimit((BuildingClassTypes)kBuilding.getBuildingClassType());
    	bool bCulturalVictory1 = kOwner.AI_isDoVictoryStrategy(AI_VICTORY_CULTURE1);
    	bool bCulturalVictory2 = kOwner.AI_isDoVictoryStrategy(AI_VICTORY_CULTURE2);
    	bool bCulturalVictory3 = [COLOR="Red"]kOwner.AI_isDoVictoryStrategy(AI_VICTORY_CULTURE3[/COLOR]); 
    in CvDllWidgetData.cpp

    Code:
            if (kPlayer.AI_isDoVictoryStrategy(AI_VICTORY_CULTURE3))
    I don't know if it has any meaning, but I've noticed that every other entry of AI_VICTORY_CULTURE3 is preceded by (GET_PLAYER(getOwnerINLINE()), while these two entries are not. It's a shot in the dark because I'm in no way an expert programmer, but do you think this might be connected to our problem?

    Either way there's something else I don't understand: usually OOS errors are not-repeatable; so how am I supposed to know if I've fixed one or not? :confused:
     
  11. Thunderbrd

    Thunderbrd C2C War Dog

    Joined:
    Jan 2, 2010
    Messages:
    25,316
    Gender:
    Male
    Location:
    Las Vegas
    kOwner. has been programmed in that particular function to be a macro representation of GET_PLAYER(getOwnerINLINE()). What COULD be a problem is if kOwner is not a representation of GET_PLAYER(getOwnerINLINE()) but IS a representation of getActivePlayer - that would be the problem right there if you can find that.

    The problem I have with OOS hunting is that I worry that I'm not seeing it when it's right in my face. I know of a problem existing somewhere in the combat cycle, perhaps particularly with withdrawal checks somewhere, but looking for any known cause of an OOS taking place in there is simply proving futile to me. I'm not seeing the dreaded getActivePlayer anywhere but then it may be that, like the problem you're hunting down, there's SO MANY different functions that CAN get processed.

    Hey AIAndy... does combat processing take place async when the view zooms in? If it does... how can it possibly not cause OOS errors on plain vanilla? I suspect, for this reasoning, that combat processes are generally synchronized... There is a call to getActivePlayer in Koshling's viewport coding when the battle zooms in and I'm wondering if that might be a culprit.
     
  12. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    The CvDllWidgetData occurance is different in that it is an async context but one only used when cheating is on (it gives debugging information about the AI).
    But except if you wanted to find out if this function is called async at some times then you searched in the wrong direction. You need to look what AI_isDoVictoryStrategy does and what it calls and look if there is something in there that might cause both computers to calculate a different result.

    You won't know at once but if no other OOS with a log that points in a similar direction occurs for some time then you know that you might have hit the right spot.
     
  13. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    Combat processing changes gamestate so it must always be synced. Viewports on the other hand are not synced at all so they must not change gamestate. So within the dependency on getActivePlayer the only things called should be graphical (like moving the viewport, zooming in, displaying something).
     
  14. Thunderbrd

    Thunderbrd C2C War Dog

    Joined:
    Jan 2, 2010
    Messages:
    25,316
    Gender:
    Male
    Location:
    Las Vegas
    I figured that was the case considering the overall behaviors taking place. So his call there to getActivePlayer should be fine. And the problem is lurking elsewhere. Is there anything that explains some kind of rare random CHANCE for an OOS or is it always going to be reliably taking place when it hits the section where it causes the OOS?

    Reason I ask is because it really seems like we can have processes and functions take place just fine many times then for no apparent reason go OOS at some point even though it's run through those processes and functions previously without error even in the same game session.
     
  15. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    The ones taking place every time are the easy ones but often specific circumstances are needed for the OOS to happen.
     
  16. Thunderbrd

    Thunderbrd C2C War Dog

    Joined:
    Jan 2, 2010
    Messages:
    25,316
    Gender:
    Male
    Location:
    Las Vegas
    As I consider how an OOS may be taking place in say, the withdrawal mechanism in combat, what kind of specific circumstances (in generic terms) might lead to an OOS?
     
  17. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    One cause can be uninitialized variables. The most common value of uninitialized variables is 0 as a lot of the memory tends to be zeroes. But it depends on whatever value was there before and this can be different between the computers. So often you might see no difference as the value is 0 on both but sometimes you will see a desync.

    Another possibility is bad caching. When a player checks out the odds for an attack he executes quite some parts of the combat calculations in async context. Any caching that is done at that point is only done on one computer. If the result you would get changes before some actual fight happens (and the cache is not invalidated at that point), the caching would cause a desync.
     
  18. Thunderbrd

    Thunderbrd C2C War Dog

    Joined:
    Jan 2, 2010
    Messages:
    25,316
    Gender:
    Male
    Location:
    Las Vegas
  19. 45°38'N-13°47'E

    45°38'N-13°47'E Chieftain

    Joined:
    Jun 7, 2008
    Messages:
    5,833
    Location:
    Just wonder...
    More on my hunt for OOS errors. Right now I'm trying to understand what kind of OOS errors I can find. I've seen and understood (although not pinpointed and solved) the problem above where AI Maximum War was returning different values for different players. Now I've found a couple of OOS which look different:

    Player0

    Code:
    416013	54	Cahokia	AI Best Building	25	19
    416014	54	Poverty Point	AI Best Building	25	16
    Player1

    Code:
    416013	54	Poverty Point	AI Best Building	25	16
    416014	54	Cahokia	AI Best Building	25	19
    There are other entries in the random log, but it looks like values are exactly the same, only displayed in a different order. Could it be something related to multithreading? Or is it something else?

    Another kind of OOS I've got is

    Player0

    Code:
    17276	161	Global	AI Diplo Trade War	10	0
    Player1

    Code:
    17276	161	Global	AI Research	2000	149
    So here entries are different; any idea where should I start to look? Is the problem Research? Or is it Diplo Trade War (or both)?

    Thank you for any advice you might give me. :)
     
  20. AIAndy

    AIAndy Chieftain

    Joined:
    Jun 8, 2011
    Messages:
    3,396
    That is the pipeline multithreading Koshling added. Each pipeline work item gets its own rand seed so it does not matter if the order between different work items is swapped.
    The names of the pipeline work items in this case are Cahokia and Poverty Point. If a log entry is not from a pipeline, you get Global at that point.
    All log entries that have the same name at that point have to be in fixed order but different names can be swapped without problem (and you will see that frequently).

    That means that because of some if condition different paths were followed on both computers.
    Now you backtrack from both positions and try to find where the point could be on which they were joined and what condition and values would have caused them to split. If you are lucky you find one of the standard OOS situations.

    One suspicious thing I found was that AI_TechValueCached, which is on the way to "AI Research" can be called both sync and async, given by the bool bAsync that is passed to it. But it caches values in the same cache in both cases. So if for the same player that function is called in both sync and async context, it will likely go OOS.
    So I would highly suggest changing
    Code:
    {
    	MEMORY_TRACK_EXEMPT()
    	m_cachedTechValues[eTech] = iValue;
    }
    
    to
    Code:
    if (!bAsync)
    {
    	MEMORY_TRACK_EXEMPT()
    	m_cachedTechValues[eTech] = iValue;
    }
    
     

Share This Page