How to find the cause of a freeze?

Maniac

Apolyton Sage
Joined
Nov 27, 2004
Messages
5,588
Location
Gent, Belgium
If there's a crash, you can simply use the debugger to check at what point the game crashes. But what can you do if the game freezes, keeps spinning indefinitely during 'Waiting on Civilizations'? Is there an easy way to check the cause of these bugs?
 
If you have the debugger attached, hit F12 and it will break the code. If it is a loop that you caused, then it should be in the DLL and you will most likely come to legible code. Sometimes there are graphical things which take up most of the loop time though, so you might get Dissasembly gibberish for a little while since it is in another DLL or the EXE. Just F10 for a while till it comes back to normal code.
 
Thanks for the reply! Unfortunately F12 didn't lead to readable code. But by enabling a couple dozen breakpoints I was able to pinpoint the problem.
 
This thread hasn't been dead too long, I suppose.

I also have a problem where a player submitted a save game, which reproducibly hangs as soon as I click "next turn". I can successfully set breakpoints according to Refar's Guide and I get correct stack traces from this, for example doHolyCity. But, when I "break all" (ctrl-alt-break) I get a stack trace which only shows mss32dll, ntdll and kernel32dll. I use F10 for a while as suggested here. However, as it goes through the assembly, I see that it is in fact looping completely in the assembly. That is, a particular address such as mss32dll!211015c1 appears about every 20 times I hit F10. Even after leaning on the F10 key for several minutes and auto-repeating, it never goes to what I can call "useful" code.

I have absolutely no idea which functions are involved, it could be anything in the whole mod. So I have no idea where to set breakpoints.

Is there any strategy for getting a useful stack trace in this case, or a strategy for setting breakpoints?
 
Once you get tossed into Assembly, it generally takes ages to get back out of it, and many times you don't REALIZE you are out of it, because it'll keep you in assembly view instead of going to the actual code unless you close the assembly window before doing the step.

Focus on what happened before it jumped to assembly. Something at that point was the issue. I would suspect graphics/sound as that is typically the main thing that someone might mod which will spend inordinate amounts of time out of the DLL. Of course, to see what comes before the Assembly you need to guess at some breakpoints, so go to CvGame::doTurn and place a breakpoint there, and then another in CvPlayer::doTurn. This should ensure that you get a break to happen, and you can at least figure out on which player's turn the issue comes up. Then you can work your way down through the CvPlayer function till you find the last call which works, go to that function and repeat.
 
That is clearly tedious, but what the heck. I set a breakpoint at the start of CvPlayer::doTurn, and observe the field m_eID to find out what player is running. It is a nine player game. I see id 3,4,5,6,7,8 complete successfully, then the next id is 50 (barbarian, surely), and when I F5 continue, it hangs. Since the players are shuffled, I assume it is coincidence that the id's appear in order, and I assume that id 0,1,2 would be called after the barbarian player.

I guess that means it is during the barbarian player's execution of CvPlayer::doTurn. This function is 150 lines long, so I choose one breakpoint every 50 lines. doResearch is called 50 lines in, verifyCivics is called 100 lines in, and the last line calls CvEventReporter::endPlayerTurn. All three of these breakpoints hit successfully, and when I F5 continue after the endPlayerTurn breakpoint hits, then I get the hang.

That says it is not in CvPlayer::doTurn, but rather after the barbarian finishes and before the next player starts. That would be in the function that calls CvPlayer::doTurn, which is CvPlayer::setTurnActive. But here I get confused. The game is a single player, not pbem, not hot seat, and not simultaneous team turns. If I trace out the logic in setTurnActive, *nothing happens* between the end of one player's doTurn and the start of the next player's doTurn.

Can you suggest how I may proceed to find a hang which occurs after the barbarian player's CvEventReporter::endPlayerTurn but before the next player's doTurn is called?
 
Players are only shuffled in a multiplayer game. Player 0 is the human, players 1 and 2 must be dead is all.

Use the "Step Out" function to go back up till you hit a layer of functions where a crash happens. Most likely you'll wind up back in CvGame::doTurn, but the question is if it makes it back around to try and start the next turn, or breaks before this one completely wraps up.

The stack trace (in the debugger) should show you the order of function calls leading to what you are observing, so you can use that instead of stepping out, but it can be nice to watch things progress.

Code:
	else
	{
		for (iI = 0; iI < MAX_PLAYERS; iI++)
		{
			if (GET_PLAYER((PlayerTypes)iI).isAlive())
			{
				if (isPbem() && GET_PLAYER((PlayerTypes)iI).isHuman())
				{
					if (iI == getActivePlayer())
					{
						// Nobody else left alive
                        GC.getInitCore().setType(GAME_HOTSEAT_NEW);
						GET_PLAYER((PlayerTypes)iI).setTurnActive(true);
					}
					else if (!getPbemTurnSent())
					{
						gDLL->sendPbemTurn((PlayerTypes)iI);
					}
				}
				else
				{
					GET_PLAYER((PlayerTypes)iI).setTurnActive(true);
					FAssert(getNumGameTurnActive() == 1);
				}

				break;
			}
		}
	}

This is what you have managed to get all the way through so far, so only this bit is left:

Code:
	testVictory();

	gDLL->getEngineIFace()->SetDirty(GlobePartialTexture_DIRTY_BIT, true);
	gDLL->getEngineIFace()->DoTurn();

	PROFILE_END();

	stopProfilingDLL();

	gDLL->getEngineIFace()->AutoSave();

And then you'll be done with this turn. So I guess if we had checked for an autosave being made for the next turn that would have told us if this finishes. Most likely you were given the Autosave with the freeze report though, so that would imply that the break happens in here. If not, there is a LOT of code before any player is given their turn for the next round (namely the barbarian spawning stuff). Basically, if the save isn't being created, then the break must be happening in testvictory, or somewhere more obscure by a longshot.
 
Can you suggest how I may proceed to find a hang which occurs after the barbarian player's CvEventReporter::endPlayerTurn but before the next player's doTurn is called?

CvPlayerAI::AI_unitUpdate()
CvUnitAI::AI_Update()

if CvUnitAI comes up several times for the same unit, you have found the problem.

It's a bit confusing because what is normally considered a turn and what is called a turn in the SDK is quite different. A lot of things like units performing missions actually take place in the update cycle and outside of the turn functions.

also put a breakpoint at player.setAutoMoves(false); in CvGame::updateMoves(). This is triggered after all barbarian units performed their mission but before the new gameturn starts (where function like createbarbarians etc. are run)
 
Thanks for the suggestions. After a long time of putting breakpoints and stepping through things, I was able to narrow it down and fix it. A lot of customization for Dune Wars is done in python, where it is not compute-intensive or in any inner loop. I eventually found that the EndGameTurn python event was being called and never returning. So then I added premature "return" statements throughout the python code to find what was hanging. As often happens, in *retrospect* it is an obvious coding error I should never have made.

The key lesson is: never continue when you need to advance an iterator:
Code:
	(pUnit, iter) = pPlay.firstUnit(false)
	while (pUnit):
		...
		if pUnit.isHasPromotion(...):
			continue
		(pUnit, iter) = pPlay.nextUnit(iter, false)
As you see, the first time isHasPromotion(...) returns true, the code will infinite loop.
 
New occurrence of same problem, totally stuck again.

In practically every game of DW 1.7.1 (not released yet) there is a hang around turn 360-400. It happens in 9-player autoplays every time, 5-player autoplays sometimes, regardless of which combinations of civilizations are included. Sometimes, reloading from the autoplay, interrupting the autoplay, clicking "next turn" a few times, then autoplaying gets past the hang point; sometimes not.

I have filled up CvGame::update, CvGame::updateMoves, CvPlayer::setAutoMoves, CvPlayer::setTurnActive and CvPlayer::doTurn with fprintf statements sent to a file which I fclose after each fprintf so there is no file buffering issue.

In a non-hanging turn, there are many calls to CvGame::update which calls CvGame::updateMoves. This is the "heartbeat" of civ. In some of these updates, there is a call chain where CvPlayer::setAutoMoves calls CvPlayer::setTurnActive calls CvPlayer::doTurn. In the rest of the updates, this call chain does not happen. After a number of updates, all of the players are looped through.

In the hanging turn, the update for player 0 is called and returns successfully. Then an infinite number of updates happen with no call to setAutoMoves. So, for some reason, the request to run player 1 never happens. The game sits there doing updates forever.

How can I find out why the game stops trying to run player turns?
 
I would check CvUnitAI::AI_update too. If the WoC is caused by a unit, you find it out this way. Maybe also CvSelectionGroup::StartMission
 
Thanks for the suggestions. I traced out the code for CvPlayerAI::AI_unitUpdate; it is only called by CvGame::updateMoves, and I know this is being called and returning. So this cannot cause my infinite loop. I tried to trace out CvSelectionGroup::startMission, but it goes a pretty long distance up.

What has me confused is that I have printf's at the start and end of CvGame::update, which is the very top of the call stack as far as I can tell. It is starting and finishing. If any call children of this function were causing the infinite loop, then CvGame::update would never finish. Is there any other way that sdk code can be called?

EDIT: When you have a hang, it can appear two different ways. I have seen hangs where the CPU is pegged at 100% and you cannot get any response from the civ4 window. This is a typical infinite loop, where an update never finishes. What I have here is different; the GUI is still active and unit animations are running, and the CPU is not pegged. However, after many minutes it never does anything. I am not sure that helps any, but it looks different from other infinite loops I have seen.
 
If you move the mouse to the edges of the screen, does the map scroll? If so, is it choppy? If so, it sounds like runaway Python code (infinite loop).
 
I think there are two types of hangs. I just retried. In my existing problem, the CPU is not pegged at 100%, it is around 15%. I can minimize and maximize the window, and I can scroll. Windows task manager says the process is "running".

I added an explicit infinite loop in python to onEndGameTurn, and the behavior is different. Although the animations keep going, the CPU is pegged at 100%, I cannot scroll with the mouse, and if I minimize, I am not able to maximize again. Task Manager shows "not responding".

Breakpoints and print statements don't seem to be able to crack the problem. I have tried cutting down the game using WB from a reproducible save. With various things cut out, the hang goes away. But there is no pattern to what makes it go away. I tried cutting things out of the python. I thought I had a lead by cutting out one function. But, this function calls the random number generator, and commenting out just that call "cures" the hang for this game. All that means is, some subsequent call to the RNG is now returning a value which happens to bypass the calculation causing the hang, with no information about which calculation that is.

I can go back a few versions (to 1.7.0.2) and it does not hang; which means I can try re-adding the changes from 1.7.0.3 - 1.7.0.6 to see when the hang reappears. That is inefficient, so if anybody has suggestions about another approach, it would be greatly appreciated.
 
I take it CvPlayer::setTurnActive() is set true for Player 0 but never false?
 
I fixed the problem, although I am not sure how it caused the effect I saw. After many more painful experiments of adding and removing multiple features, I found that the code I added to have transports unload automatically at the end of the turn was causing the problem. Sephi suggested to put this code in onEndGameTurn, but I had put it in AI_unitUpdate. The call was finishing successfully. But apparently it was having some bad side effect, in some way I do not understand. When I removed this code, the hangs never happened anymore. I moved the code into AI_turnUnitsPost, and now the hang is not happening.

There was no magic bullet, just a lot of brain time.
 
Do you still have the savegame that causes the WoC? You could try to fix it by adding this to the code (will make sure all unloaded units have a Mission). AI_turnUnitsPost is run before any unit movements are done or missions pushed, I doubt it will help here.

Spoiler :
Code:
if (!player.isHuman())
                        {
                            for(CvUnit* pLoopUnit = player.firstUnit(&iLoop); pLoopUnit != NULL; pLoopUnit = player.nextUnit(&iLoop))
                            {
                                if(pLoopUnit->canUnloadAll() && (!pLoopUnit->plot()->isWater()))
                                {
                                    bool bDanger=false;
                                    // Are we in Danger?
                                    int iOurDefense = player.AI_getOurPlotStrength(pLoopUnit->plot(), 0, true, false);
                                    int iEnemyOffense = player.AI_getEnemyPlotStrength(pLoopUnit->plot(), 4, false, true);

                                    if (iOurDefense < iEnemyOffense)
                                    {
                                        bDanger = true;
                                    }

                                    if (bDanger)
                                    {
                                        pLoopUnit->unloadAll();
[b]
    CLLNode<IDInfo>* pUnitNode;
    CvUnit* pLoopUnitUnloaded;
    pUnitNode = pLoopUnit->plot()->headUnitNode();
    while (pUnitNode != NULL)
    {
		pLoopUnitUnloaded = ::getUnit(pUnitNode->m_data);
		pUnitNode = pLoopUnit->plot()->nextUnitNode(pUnitNode);

        if (pLoopUnitUnloaded->getGroup()->getLengthMissionQueue()==0) 
        {
            pLoopUnitUnloaded->getGroup()->pushMission(MISSION_SKIP);
        }
}
[/b]
                                    }
                                }
                            }
                        }
 
Shazbot! Evidently all the time I spent tracing out the calling sequence did not prevent me from making this mistake. Thanks for pointing it out.

I really wish civ had a good regression technique so I could store all the special purpose WB setups I make when I initially test these features. When I first added your code I had a save game which had an AI transport stack about to cross a continent to attack, so it was really easy to see. But as soon as you add a unit or promotion or whatever, the save game is no longer loadable. So I did not test the code in its new position, it seemed "obviously correct". Sigh.

I did not want to put it into CvGame since it uses private CvPlayerAI functions. I can't recall exactly why I put a loop over all units into AI_unitUpdate in the first place, that seems inefficient / wrong. Adding missions to the units may help but I no longer feel AI_unitUpdate is the right place.

Where can I add this loop over all units, in CvPlayerAI, so it is only called once per turn? Or maybe CvPlayer, if that can already see the CvPlayerAI functions needed?
 
Warning: I know nothing about the AI code. That's an area I just haven't explored. This suggestion is based solely on reading CvGame::doTurn() and CvPlayer::setTurnActive().

CvPlayerAI::AI_doTurnUnitsPre() is called at the top of CvPlayer::doTurnUnits() which is called before CvPlayer::doTurn() as long as it's not a simultaneous-turn MP game. Otherwise it's called immediately after doTurn().
 
I did not want to put it into CvGame since it uses private CvPlayerAI functions. I can't recall exactly why I put a loop over all units into AI_unitUpdate in the first place, that seems inefficient / wrong. Adding missions to the units may help but I no longer feel AI_unitUpdate is the right place.

Where can I add this loop over all units, in CvPlayerAI, so it is only called once per turn? Or maybe CvPlayer, if that can already see the CvPlayerAI functions needed?

You could add a !hasReadyUnit(true) check. This should only return false the last time CvPlayerAI::AI_unitupdate is called. A few units might still perform their missions after it though in the automission cycle in updatemoves function.
 
Top Bottom