OOS Debugging

keldath

LivE LonG AnD PrOsPeR
Joined
Dec 20, 2005
Messages
7,558
Location
israel
hey,

well, started a multiplayer , on Doto.
something is off with the music,
it feels that the more zoom i make , the sound goes away.
right now, to ehar the music, i have to be at a pretty far off view / zoom.
i think musing should only be off at high globe view of almost max zoom .
right now, its less fun :) i really like the music of civ.

edit:
had an oos error just now....
right after war declaration.
save of the oos and an auto save before the oos.
let me know what logs do you need.

using Doto Candidate 3. but not using any of my special mod options but just some regular ones i played with on my last mp game ) v1.05 i think.
i wanted to do some easier mod test run.
 

Attachments

Last edited:
Note: This thread has been split off from the old AdvCiv thread at my request.

it feels that the more zoom i make , the sound goes away.
right now, to ehar the music, i have to be at a pretty far off view / zoom.
Sounds like you're using a high field-of-view value and then zoom in pretty far (flat camera angle). You should probably set a close default camera distance then on the Map tab of the BUG menu. Related post(s):
https://forums.civfanatics.com/threads/advanced-civ.614217/page-41#post-15904688
had an oos error just now.... right after war declaration.
save of the oos and an auto save before the oos.
let me know what logs do you need.
Logs are only useful if both instances of Civ 4 write them. If the error is reproducible by running two instances on the same PC, then that's the easiest way to get the logs. For the OOSLog, MessageLog=1 in the INI should suffice, for RNG calls in MPLog, RandLog=1 needs to be set in addition. I can probably get the logs from your savegames (when I get around to it).
 
Last edited:
Heya,

Alright, thanks, ill play with the sound.

ok i read this :
https://forums.civfanatics.com/threads/fix-and-avoid-oos-issues-dll-and-python.639164/

i loaded 2 instances but could not reprooduce the oos.
so, i guess its a random number issue. ill try to find it.
how will the mplog tell me whats wrong?

Ok so in the logs, ill guess ill see numbers, will that tell me where is the source function of the not synced numbers?

Also, i coulnd find anywhere the ooslog it self(i looked in the oos py file)
Other then all the ini logs set to 1, do i need to enable it somehow?
is there a button to it?
 
Last edited:
i loaded 2 instances but could not reprooduce the oos.
If no OOS error occurs, then logs won't help. The likeliest causes, in my experience, of non-reproducible OOS errors are uninitialized memory and synchronized data not being stored in savegames. Those issues can usually be detected in singleplayer games (preferably with a non-debug build, i.e. assert or release) by running AI Auto Play twice from the same savegame (e.g. initial save of a new game) and comparing the resulting game states. If they differ, then my "ReproTest" class (see ReproTest.h) can assist in identifying the cause. That said, it's well possible that an issue observed in multiplayer isn't immediately going to come up in some random singleplayer game, even if the issue could, in principle, occur in singleplayer.

I've just run 2x 300 turns from the same initial save in AdvCiv and got an "Ottoman revolt in New York" at the end of both; so chances are that these two runs have arrived at the exact same game state.

It's also possible that the problem will only ever occur in multiplayer, perhaps because certain plays (maybe certain trades between human players) can't come up singleplayer, or because of some computation that has a hardware-specific result. Though floating-point issues should already be ruled out by CvGame::doFPCheck at game start. Ultimately, such elusive problems can only be debugged by having the relevant logs enabled on both machines and then sending the log files to the person who does the debugging.

OOS errors caused by non-synchronized code modifying synchronized data usually can't be identified through AI Auto Play (because most non-synchronized code deals with UI actions by human players), but they should be reproducible with two instances on the same PC – assuming that one remembers the exact steps that lead to the error.
how will the mplog tell me whats wrong?
Ok so in the logs, ill guess ill see numbers, will that tell me where is the source function of the not synced numbers?
If you have two logs that have gone OOS (from a multiplayer game, or from two divergent AI Auto Play runs), then a diff will show where they first diverge. Oh, the BBAI log (BBAILog.h) and the Python event log (__LOG_ALL in CvEventManager.py) can also be helpful for this. As for the RandLog (MPLog.txt), one has to search the code base for the message in the log file to find the CvRandom call that has created the message. E.g. a message saying "Event Active?" would originate from getSorenRandNum(100, "Event Active?") in CvGame::initEvents. The question then becomes "why does this call not always happen?" – i.e. the proximate cause remains to be identified.
Edit: I've recently (for v0.99 maybe) written macros that write function names and line numbers to the RandLog:
github.com/f1rpo/AdvCiv/.../CvRandom.h#L85
The "Event Active?" getSorenRandNum call above would then be replaced with SyncRandNum(100).
Also, i coulnd find anywhere the ooslog it self(i looked in the oos py file)
Other then all the ini logs set to 1, do i need to enable it somehow?
is there a button to it?
It'll only get written when an OOS error occurs, i.e. when the big red letters appear on the screen. By comparing two OOS.log files, one can tell how the two game states differ. E.g. it could show that a particular unit is at coordinates (54,7) in one game state and at (56,6) in another.
 
Last edited:
hey,
then my "ReproTest" class (see ReproTest.h) can assist in identifying the cause.
how by break points?

I've just run 2x 300 turns from the same initial save in AdvCiv and got an "Ottoman revolt in New York" at the end of both; so chances are that these two runs have arrived at the exact same game state.
a. how did you managed to execute ai auto play in an mp play?
b. i ran manuall around a hundred turns on my 2 instances of civ. couldnt get oos.
c. "Ottoman revolt in New York" - you mean that its not synced or synced and just an example where you reached?

computation that has a hardware-specific result.
can oos happen due to some hardware issue? maybe network hiccup or something?

BBAI log (BBAILog.h) and the Python event log
didnt see these logs - where am i to look / or define something in the bbai.h

thats very smart.
It'll only get written when an OOS error occurs
its something you changed or darkLuna in the ooslogger right? i read some of the file.

let say, i get the plot example,
what then - how one can reverse engineer it to the source?

anyway for my next session,
ill make sure that both of us have all the logs on.
i didnt use any of the Doto code in the game i think (i wanted a dry test for the game),
i wonder what in theory could bring the oos from my additions.
maybe i should examine some soren andom i might have added?

thank you :)
 
First up, I've just asked the moderators to move this whole conversation into a separate thread in the Advanced Civ subforum. It's conceivable that it'll be of interest to some third person, but not for most players.
a. how did you managed to execute ai auto play in an mp play?
I did it in singleplayer. Just in case that it's an issue with uninitialized memory or synchronized data not getting saved (and that it occurs commonly enough to come up just like that). If it can be addressed in singleplayer, then that's much more convenient than having to juggle multiple instances.
how by break points?
If I know that AI Auto Play (in singleplayer) will diverge if I run from a particular savegame for, say, 200 turns, then I might reload the save and try 100 turns next, then 50 or 150 ... – until I find an autosave from which AI Auto Play will yield inconsistent results within a few turns. From there, ReproTest can be used to find out after which turn exactly an inconsistency occurs (can be hard to tell from just looking at the map; ReproTest checks the entire game state) and how the two game states differ. Well, I don't think you're in this situation. There are instructions in ReproTest.h,

AI Auto Play can be used in multiplayer too; that's a different kind of test, but also potentially helpful. This requires ENABLE_AUTOPLAY_MULTIPLAYER = 1 in GlobalDefines_devel.xml. There's also ENABLE_DEBUG_TOOLS_MULTIPLAYER for Debug mode and WorldBuilder; please see the description in GlobalDefines_devel.xml about that.
b. i ran manuall around a hundred turns on my 2 instances of civ. couldnt get oos.
That's also a useful (albeit tedious; but it's all tedious ...) way of testing. The AI will make offers this way that wouldn't come up in AI Auto Play.
c. "Ottoman revolt in New York" - you mean that its not synced or synced and just an example where you reached?
Getting this unusual event after exactly 300 turns both times strongly suggests to me that both AI Auto Play runs were in-sync. So no problem was evident from this test. Of course this isn't to say that the problem must lie with DotO; it could well be something in my code.
can oos happen due to some hardware issue? maybe network hiccup or something?
I doubt it. I imagine that the CvNet... messages are sent over TCP, which should all but guarantee delivery.
didnt see these logs - where am i to look / or define something in the bbai.h
CvEventManager writes to PythonDbg.log. The BBAI log requires a recompile after uncommenting //#define LOG_AI.
its something you changed or darkLuna in the ooslogger right? i read some of the file. [...]
anyway for my next session, ill make sure that both of us have all the logs on.
The OOSLogger has always checked CyInterface().isOOSVisible(), i.e. whether the game is actually out of sync. I've added the MessageLog check because, if a game is continued after an OOS error, a player could inspect the log and see some information that is supposed to be secret. (That's a pretty farfetched way of cheating, so perhaps the log should be written regardless of the MessageLog setting.)
let say, i get the plot example,
what then - how one can reverse engineer it to the source?
If that unit being in the wrong spot is the only difference, then you might check its UnitAI type in Debug mode (after going OOS and, to be on the safe side, also in the autosave on the turn before), then look at the respective CvUnitAI routine, say, AI_pillageMove if the AI type is UNITAI_PILLAGE. If there are additional logs to compare, then that could get you closer. For example, if "AI Patrol" is the last in-sync RNG call in MPLog.txt, then it would seem likely that the error occurred in or after AI_patrol. If the error is reproducible with two instances on a single PC, then you can set a debugger stop in AI_patrol and step through the code, first with the debugger attached to the one process, then, after restarting and reloading, to the other process. If you take too much time in the debugger though, i.e. longer than a few(?) seconds, the network connection times out.
i wonder what in theory could bring the oos from my additions.
maybe i should examine some soren andom i might have added?
The RNG calls themselves aren't all that error-prone. One mustn't call a synchronized RNG (SorenRand, MapRand) in a non-synchronized context and mustn't change synchronized game state based on the outcome of a non-synchronized RNG call (ASynchRand). A more common error is to change synchronized game state of the ActivePlayer (or ActiveTeam). All this would lead to reproducible errors though (with two instances on one PC). The state of the (supposedly) sync'd RNGs is the first thing that goes into the OOS checksum
Code:
int CvGame::calculateSyncChecksum()
{
   int iValue = 0;
   iValue += getMapRand().getSeed();
   iValue += getSorenRand().getSeed();
   // ...
}
and that's what causes the game to declare that there is an OOS error. No randomness involved.
 
First up, I've just asked the moderators to move this whole conversation into a separate thread in the Advanced Civ subforum. It's conceivable that it'll be of interest to some third person, but not for most players.
yes good call

ReproTest checks the entire game state) and how the two game states differ. Well, I don't think you're in this situation. There are instructions in ReproTest.h,
alright ill have a look at, thanks.
AI Auto Play can be used in multiplayer too; that's a different kind of test, but also potentially helpful. This requires ENABLE_AUTOPLAY_MULTIPLAYER = 1 in GlobalDefines_devel.xml. There's also ENABLE_DEBUG_TOOLS_MULTIPLAYER for Debug mode and WorldBuilder; please see the description in GlobalDefines_devel.xml about that.
how awesome is that... ill run some turns then, really nice.
can a debugger be attached?

That's also a useful (albeit tedious; but it's all tedious ...) way of testing. The AI will make offers this way that wouldn't come up in AI Auto Play.
tedious, but, nice to see that in the same time ai pops a question to both humans.

CvEventManager writes to PythonDbg.log. The BBAI log requires a recompile after uncommenting //#define LOG_AI.
well i got kicked, so next game is a new. so ill do a recompile than.

CyInterface().isOOSVisible()
i used ransack, to find it - i couldnt anywhere - is it an exe func?


int CvGame::calculateSyncChecksum()
{
int iValue = 0;
iValue += getMapRand().getSeed();
iValue += getSorenRand().getSeed();
// ...
}
ok i read about how to debug by compare in nightingale's post there.
also read in another thread about that there are none random numbers, which is a stamge concept.
i hope to get an oos log soon or dont hope to..:)

that was very informative post.
ill try to run ai on 2 instances for long, see what happens, hope my cpu wont blow up.[/quote]
 
can a debugger be attached?
Yes. One instance can be started from within Visual Studio as normal. Maybe one could even start a second instance of Visual Studio in order to have debuggers attached to both Civ instances; I've never tried that though. Upon hitting a breakpoint, the network connection will begin to time out. Not that much of a problem as the instance with the debugger will continue to operate normally until it reaches code that requires a network connection; it's mostly annoying. The timeout is handled entirely by the EXE, apparently.
tedious, but, nice to see that in the same time ai pops a question to both humans.
:hmm: "If you're talking to Saladin, then who am I talking to?"
i used ransack, to find it - i couldnt anywhere - is it an exe func?
Yes, it's listed in the Apolyton/BUG Python API. Without further info other than that CyInterface is "not in the SDK."
ill try to run ai on 2 instances for long, see what happens, hope my cpu wont blow up.
Shouldn't be too bad; they can each use a different core.
 
hey there,

i did some more tests,
seems the OOS is reproducable.
i ran auto play on my game, on my pc , 2 instances.
i carried out a game me and my friend started ->
auto play of 100+100 turns after around 100 actual turns played with my friend.

i checked the logs, the ones i got you prior,
i encountered the issue - pick a song async.

in this attached save,
i saw the same issue - on diplomacy, i exited diplo screen and OOS occurred.

attached are the save, screen shot and the logs, though i dunno how 2 instances write to one logs folder :) .
 

Attachments

  • Logs-solomulti.zip
    Logs-solomulti.zip
    696.3 KB · Views: 129
  • sameCPU_ooo.CivBeyondSwordSave
    sameCPU_ooo.CivBeyondSwordSave
    162 KB · Views: 108
  • ooo_same_cpu.jpeg
    ooo_same_cpu.jpeg
    569 KB · Views: 120
Thanks. Can the problem be reproduced from the savegame you've attached? What action do I have to take? Iirc, I got a diplo popup from Alexander upon loading the save, but closing the popup didn't take the game out-of-sync for me (don't recall which reply I chose in the diplo popup).
i checked the logs, the ones i got you prior,
i encountered the issue - pick a song async.
It's normal that the ASyncRand calls don't line up. AdvCiv 0.99 will write a separate logfile for ASyncRand calls; they're normally not relevant for OOS debugging. The OOS logs show that the game states differ in a lot of ways; not that helpful in this case. Seems to me that the true point of divergence in MPLog (from your post here) is
Rand = 1601794504l / 10000 (advc.104 (cap)) on 19387
being written on your PC but not on your opponent's, i.e. this line in UWAIAgent.cpp
if( ::bernoulliSuccess(prSkip, "advc.104 (cap)"))
was executed only on your PC. Unfortunately, it's not clear to me why this would happen. Maybe I can figure it out, but if the issue (and if it's indeed the same issue) can be resolved by reproducing it in the debugger, then that sounds easier.
attached are the save, screen shot and the logs, though i dunno how 2 instances write to one logs folder :) .
Both processes write to the same logfiles – which makes them pretty much unusable. OOSLogger should write separate files; it includes the player id in the name of the logfile. However, I see only one file
OOSLog - 0Keldath - Turn 131
in your Log-solomulti archive: :confused: Well, it's probably not going help anyway.
For MPLog, PER_PLAYER_MESSAGE_CONTROL_LOG (GlobalDefines_devel.xml) can be set to 1 in order to write a separate logfile for each player.
 
hey hey,

ok so, from what i did i just ran around some auto ai in double instance over my cpu.
and, once the cycle ended, i had some diply ai request, upon exiting it, the oos error occurred.
same as in my game with my friend.
we did a load game though, issue did not return (we lost anyway, so started a game with civ4 re imagined , figured ill wait for your reply).

reproducing it in the debugger, then that sounds easier.
should i run atest game as i did with a debugger dll?
how will i know where to set a break point / or will n assert pop up my come?

PER_PLAYER_MESSAGE_CONTROL_LOG
yup saw that.

i hope you can catch this oos, ill try to find more info as i go.
 
Having gotten back a bit into OOS debugging (without much success), some corrections and additions to my earlier post:
Both processes write to the same logfiles – which makes them pretty much unusable.
Probably not true. I think one process will reserve the logfile and the other will never be able to write to it.
I see only one file
OOSLog - 0Keldath - Turn 131
in your Log-solomulti archive
The reason could be that the two instances use different .ini files (via the ini=... command line argument) and that only one of them has MessageLog enabled.

OOS errors are often impossible to reproduce in games with simultaneous turns (ST) because such games constantly get numbers from the synchronized RNG while the game is idle -- in order to schedule (possible) unit moves (CvGame::updateMoves). Effectively, it's as if the "New Random Seed" option was always enabled in ST games. In the MongooseMod changelog (20 Dec 2010), LunarMongoose claims to have “devised a simple, effective way to lock random seeds in Network Multiplayer games when the NewRandomSeedOnReload option is not enabled.” As you know, the MongooseMod DLL source isn't publicly available (and I'm not going to gripe about that). For testing on a single PC, it should be fine to simply not randomize the priority of unit moves. Or one could just disable ST for such tests, and hope that most OOS errors will occur regardless of ST.

There also seems to be an issue with the timing of AI Auto Play in ST games. In several tests, I've seen the game go OOS exactly after the number of turns that I had entered into the AI Auto Play box. Not really a problem in itself; one just needs to be aware that those are false positives. When AI Auto Play gets cut short by CvGame::checkInSync, then there should be a real issue. (That function will also show a failed-assert popup.)

In games between actual people, I expect that the randomized order of unit moves is important, so, if an OOS error occurs, then I guess one can only try to diagnose it by inspecting the logfiles. In this case ...
if( ::bernoulliSuccess(prSkip, "advc.104 (cap)"))
was executed only on your PC. Unfortunately, it's not clear to me why this would happen. Maybe I can figure it out [...].
I don't get enough information from the logfile – too many possible explanations why that line may not get executed. The BBAI log and UWAI log might help.
ok so, from what i did i just ran around some auto ai in double instance over my cpu.
and, once the cycle ended, i had some diply ai request, upon exiting it, the oos error occurred.
same as in my game with my friend.
This sounds like it was caused by AI Auto Play. It may well be that a diplo popup was involved in the game with your friend, but I rather doubt that the same thing has occurred in your AI Auto Play test.
should i run atest game as i did with a debugger dll?
how will i know where to set a break point / or will n assert pop up my come?
I would only use the debugger when an error is reproducible. (If the error involving the UWAI capitulation check were reproducible, then I'd set a breakpoint at the start of UWAI::Team::considerCapitulation.)
 
Back
Top Bottom