Strange CTD on cityconquest

AbsintheRed · May 3, 2016

Probably it's not really relevant anymore, since you guys already gone deeper in the code since.
Nevertheless I added another assert to CvSelectionGroup::groupMove, instead of Koshling's check:

Code:

void CvSelectionGroup::groupMove(CvPlot* pPlot, bool bCombat, CvUnit* pCombatUnit, bool bEndMove)
{
	PROFILE_FUNC();

	CLLNode<IDInfo>* pUnitNode;
	CvUnit* pLoopUnit;

	pUnitNode = headUnitNode();

	while (pUnitNode != NULL)
	{
		pLoopUnit = ::getUnit(pUnitNode->m_data);
		pUnitNode = nextUnitNode(pUnitNode);

		// adding asserts for the city conquest CTD
		[B]FAssertMsg(pLoopUnit != NULL, "NULL unit reached in the selection group");[/B]

Reached another CTD with the latest code (so all 3 asserts enabled)
The ones in CvSelectionGroup::addUnit and removeUnit didn't trigger at all.
This new one triggered exactly once, right before the crash.

AbsintheRed · May 3, 2016

alberts2 said:
The real bug is removing units from a selection group while iterating through those units. The IDInfo struct at the adress the pLoopUnitNode is pointing to is removed because a unit is removed from the group. Later a XYCords struct is allocated using new at that address and the pLoopUnitNode is now pointing to that struct. This can't work and is just poorly programmed.

I'am just not sure how to really fix this without creating other issues.

Do you have any ideas which might cause anything like this?
Even though there are various Firaxis mistakes/poor programming involved in the code, it still doesn't crash anywhere else AFAIK, only in RFCE.
There is surely something our mod has that makes it more likely to reach those corrupt memory values for pLoopUnitNode.

My guess would be it's somewhere in pyhon, but it can easily be somewhere in some very old c++ code too.
Since the bug only appeared when people started to use Win 8.1 and Win 10, noone experienced these CTDs when Win 7 was the main OS...
So it wouldn't be suprising if it's caused by something added to the mod 3-4 years ago :eek:

Nightinggale · May 3, 2016

AbsintheRed said:
Probably it's not really relevant anymore, since you guys already gone deeper in the code since.
Nevertheless I added another assert to CvSelectionGroup::groupMove, instead of Koshling's check:

AbsintheRed said:
Reached another CTD with the latest code (so all 3 asserts enabled)
The ones in CvSelectionGroup::addUnit and removeUnit didn't trigger at all.
This new one triggered exactly once, right before the crash.

That is precisely what I would expect. The assert only informs you when something goes wrong while the continue statement alters how the game runs. You changed a crash safeguard into an assert telling "I'm about to crash and it is in this line".

AbsintheRed said:
My guess would be it's somewhere in pyhon, but it can easily be somewhere in some very old c++ code too.
Since the bug only appeared when people started to use Win 8.1 and Win 10, noone experienced these CTDs when Win 7 was the main OS...
So it wouldn't be suprising if it's caused by something added to the mod 3-4 years ago

The OS dependency is really weird and I can't think of anything other than how the system allocates memory. It isn't unlikely that there is a bug, which nearly crash older versions of Windows, but because (whatever) it is actually able to keep going when it should have crashed.

Sometimes something like that only makes sense in retrospect. I was once told to test something somebody else coded. After around 6 people tested it, everybody praised it and I went "it's a shame it's that unstable. It crashes quite often while starting". It turned out to be a bug in threading, which made it crash like one in a million on single core CPUs, but it crashed 1 out of 3 on dualcore. Until that bug was found, it was a complete mystery why it was that unstable on just one computer.

I'm still not convinced with the looping argument. I still don't get how it ends up with NULL pointers :think:

AbsintheRed · May 3, 2016

Nightinggale said:
That is precisely what I would expect. The assert only informs you when something goes wrong while the continue statement alters how the game runs. You changed a crash safeguard into an assert telling "I'm about to crash and it is in this line".

Yeah, I also expected this from the 3rd assert.
But I thought the assert in addUnit will also trigger at least once.

The OS dependency is really weird and I can't think of anything other than how the system allocates memory. It isn't unlikely that there is a bug, which nearly crash older versions of Windows, but because (whatever) it is actually able to keep going when it should have crashed.

Yeah, but that's what we are suspecting since the very beginning of the thread.
Maybe it will be something obvious in retrospect, I just wish something would finally point us in the right direction where is the thing (I'm not even sure anymore that it's a real bug on my end, probably an inconsistency instead) in the mod's code that makes these memory issues happen.

olduser · May 4, 2016

Nightinggale said:
That is precisely what I would expect. The assert only informs you when something goes wrong while the continue statement alters how the game runs. You changed a crash safeguard into an assert telling "I'm about to crash and it is in this line".

But NULL is a valid return value for the getUnit function and the code should handle it.

Code:

CvUnit* getUnit(IDInfo unit)
{
	if ((unit.eOwner >= 0) && unit.eOwner < MAX_PLAYERS)
	{
		return (GET_PLAYER((PlayerTypes)unit.eOwner).getUnit(unit.iID));
	}

	[B][COLOR="Red"]return NULL;[/COLOR][/B]
}

AbsintheRed · May 4, 2016

alberts2 said:
But NULL is a valid return value for the getUnit function and the code should handle it.

Code:

CvUnit* getUnit(IDInfo unit) { if ((unit.eOwner >= 0) && unit.eOwner < MAX_PLAYERS) { return (GET_PLAYER((PlayerTypes)unit.eOwner).getUnit(unit.iID)); } [B][COLOR="Red"]return NULL;[/COLOR][/B] }

Still, I'm absolutely certain that there were no other triggers on the assert I added.
Only one, right before the crash.
I can run a few more tests if you think it was just a coincidence.

olduser · May 5, 2016

AbsintheRed said:
Still, I'm absolutely certain that there were no other triggers on the assert I added.
Only one, right before the crash.
I can run a few more tests if you think it was just a coincidence.

Adding the assert is fine and can be helpfull but the code should handle that condition as well.

I finally had time to work on a solution for your CTD i made the following change to CvSelectionGroup::groupMove.

Code:

void CvSelectionGroup::groupMove(CvPlot* pPlot, bool bCombat, CvUnit* pCombatUnit, bool bEndMove)
{
	PROFILE_FUNC();

	CLLNode<IDInfo>* pUnitNode;
	CvUnit* pLoopUnit;
	[COLOR="Red"][B]std::set<int> movedUnitIds;

	while (true)
	{
		pUnitNode = headUnitNode();
		pLoopUnit = NULL;

		while (pUnitNode != NULL)
		{
			if (movedUnitIds.empty() || std::find(movedUnitIds.begin(), movedUnitIds.end(), pUnitNode->m_data.iID) == movedUnitIds.end())
			{
				pLoopUnit = ::getUnit(pUnitNode->m_data);

				FAssertMsg(pLoopUnit != NULL, "NULL unit reached in the selection group");

				if (pLoopUnit != NULL)
					break;
			}

			pUnitNode = nextUnitNode(pUnitNode);
		}

		if (pLoopUnit == NULL)
			break;
		
		movedUnitIds.insert(pLoopUnit->getID());[/B][/COLOR]

		if ((pLoopUnit->canMove() && ((bCombat && (!(pLoopUnit->isNoCapture()) || !(pPlot->isEnemyCity(*pLoopUnit)))) ? pLoopUnit->canMoveOrAttackInto(pPlot) : pLoopUnit->canMoveInto(pPlot))) || (pLoopUnit == pCombatUnit))
		{
			pLoopUnit->move(pPlot, true);
		}

It should handle the situation that leads to the crash and a few other possible issues. Try it and report back if it works for you as well.

AbsintheRed · May 5, 2016

Wow, sounds great!
Will try it out ASAP, and report back

AbsintheRed · May 6, 2016

Well, tested with 4 different savegames.
They always went through without any crashes, and it seems there are no other issues as well :king:

At least nothing obvious, like the appearing units from before.
Hopefully there is nothing strange going on in the background either.

There were only one suspicious thing, connected to unit "teleportation" to own borders.
But those rules are strange in base BtS too, so there is a chance that it's just Firaxis inconsistency.
Should the conquering stack all move to the city before that happens? Is there a chance that your fix changed something connected to this when breaking up the group?
Will try to test it more

Anyway, that is just a very minor point
A huge thanks for the help, in the name of all the RFCE community

And of course thanks to Nightinggale and everyone else too who also chipped in.

AbsintheRed · May 6, 2016

Ehh, posted too soon :/
Reverted to a different version to test some other saves
And there are still issues with this one: http://forums.civfanatics.com/showthread.php?p=14251782#post14251782
SVN 1174, the Swedish stack is about to conquer the city.
On the beginning of the next turn the city is Norwegian, with only one city ownership logged in the logfiles.
(AFAIK if they were liberating the city, there would have been 2 ownership changes)

Of course it's also possible that this is connected to a different issue.
And there is also a slight chance that something went wrong with the save itself.

olduser · May 7, 2016

AbsintheRed said:
Well, tested with 4 different savegames.
They always went through without any crashes, and it seems there are no other issues as well
At least nothing obvious, like the appearing units from before.
Hopefully there is nothing strange going on in the background either.

There were only one suspicious thing, connected to unit "teleportation" to own borders.
But those rules are strange in base BtS too, so there is a chance that it's just Firaxis inconsistency.
Should the conquering stack all move to the city before that happens? Is there a chance that your fix changed something connected to this when breaking up the group?
Will try to test it more

Anyway, that is just a very minor point
A huge thanks for the help, in the name of all the RFCE community
And of course thanks to Nightinggale and everyone else too who also chipped in.

I would say the remaining units of the group with moves left should move to the plot. This should happen right after the first unit moved and before the old city is removed. Because the removal of the old city makes it necessary to recalculate the culture on the plots around it. That could lead to changes in the ownership of those plots and that could sometimes make it necessary to bump units to a valid plot. This and the pointer invalidation issue caused your problems in the first place.

With my changes in place it should be possible to move those remaining units without crashing.

AbsintheRed · May 9, 2016

@ alberts2: I did some more testing, in various OSs, and on various situations and savegames.
Your fix seems to be working. I mean we already knew that it solves the crash itself, but as far as I can tell there are no further issues after all.
There are a few minor differences in AI unit movement, but those doesn't affect gameplay in a bad way at all.
I'm more and more convinced that the city liberation issue I posted above is because something went wrong with the save before, and it has nothing to do with anything connected to your changes.
Will upload it to the RFCE SVN with my next commit. Thanks again for spending the time on it!

On the other hand, I noticed some strange things during my tests without the fixes in the dll.
Doesn't really matter now with the fix working, but very interesting IMO.
Even when the city conquest goes through on Win 7 without a crash, strange values might appear in seemingly unconnected things to the conqest/possible crash.
This means that while there is no CTD, so no problems with pUnitNode or pLoopUnit being NULL, some values in the memory can still go corrupted?

For example in this save: http://forums.civfanatics.com/showpost.php?p=14260300&postcount=732
The Scottish second UHV triggered with some blatantly wrong values, which also resulted on some war declarations on the player.
I'm 100% percent that it has to do something with the CTD.
Preventing the city conquest in WB, or adding the fix to the CvSelectionGroup::groupMove won't result in such strange behaviour.
It may only happen if the game goes through on Win 7 without any changes.

@ Nightinggale
Did you experience anything like this in those saves? IIRC you tested that one specifically, and it went through for you too.
(Obviously you couldn't possibly notice that strange things happened, as you don't know the mod. You only knew that no crash happened, and there are no strange things in the changelog.)
Still, at least one of the python checks got a very strange value from memory.
Do you guys have any ideas how it might happen?

Nightinggale · May 9, 2016

I searched for a crash, not strange values, so I didn't look for those at all.

AbsintheRed · May 9, 2016

Nightinggale said:
I searched for a crash, not strange values, so I didn't look for those at all.

Yeah, obviously.
I just wanted to point out that the memory issues can probably still appear for you too, even if there was no crash.
So you can do some further experiments with that save if you still want to.

olduser · May 9, 2016

AbsintheRed said:
@ alberts2: I did some more testing, in various OSs, and on various situations and savegames.
Your fix seems to be working. I mean we already knew that it solves the crash itself, but as far as I can tell there are no further issues after all.
There are a few minor differences in AI unit movement, but those doesn't affect gameplay in a bad way at all.
I'm more and more convinced that the city liberation issue I posted above is because something went wrong with the save before, and it has nothing to do with anything connected to your changes.
Will upload it to the RFCE SVN with my next commit. Thanks again for spending the time on it!

On the other hand, I noticed some strange things during my tests without the fixes in the dll.
Doesn't really matter now with the fix working, but very interesting IMO.
Even when the city conquest goes through on Win 7 without a crash, strange values might appear in seemingly unconnected things to the conqest/possible crash.
This means that while there is no CTD, so no problems with pUnitNode or pLoopUnit being NULL, some values in the memory can still go corrupted?

For example in this save: http://forums.civfanatics.com/showpost.php?p=14260300&postcount=732
The Scottish second UHV triggered with some blatantly wrong values, which resulted on some war declarations on the player.
I'm 100% percent that it has to do something with the CTD.
Preventing the city conquest in WB, or adding the fix to the CvSelectionGroup::groupMove won't result in such strange behaviour.
It may only happen if the game goes through on Win 7 without any changes.

The pUnitNode pointer was invalid after the remaining units where moved from the plot so it may have caused all kinds of strange things. This is clearly a programming error and it may have caused OOS issues as well.

Another thing that helped is the getUnit function

Code:

CvUnit* getUnit(IDInfo unit)
{
	if ((unit.eOwner >= 0) && unit.eOwner < MAX_PLAYERS)
	{
		return (GET_PLAYER((PlayerTypes)unit.eOwner).getUnit(unit.iID));
	}

	return NULL;
}

It doesn't check the eOwner of the unit it should return so if you feed it with wrong parameters a unit of another player would be returned and so on.

If we are lucky this is the only place in the dll where a pointer can be invalidated.

AbsintheRed · May 11, 2016

Alright, the fix is added to the public version of RFCE too.
While it probably doesn't matter too much, I credited you with it in the changelog, in the code itself, and in the forum too!
More importantly: I'm sure all the RFCE players are really grateful for your help :king:

This CTD was an annoyance in their gaming sessions for such a long time...

It doesn't check the eOwner of the unit it should return so if you feed it with wrong parameters a unit of another player would be returned and so on.

Do you think it would be better to have a check there too, or we should just be glad that this pointed you in the right direction?

If we are lucky this is the only place in the dll where a pointer can be invalidated.

Yeah, hopefully
I'm really optimistic ATM :cool:

Thunderbrd · Sep 19, 2016

Nightinggale's instincts are strong here. I ran into an interesting similar issue in C2C and had to investigate it deeply because my thought processes were telling me many of the same things he brought up in this discussion. The shock I had was profound when I figured it out... it's NOT a null pointer being added to the list... its a duplicate unit node being added, which then creates a null pointer when the unit is moved off that list!

alberts2 said:
alberts2 said:

Newer versions of Windows are less forgiving about certain programming errors. Take a look at this thread that CTD took place in an similar scenario.

Click to expand...

Now that was a fascinating thread! I may comment on this bug there. You should take a look at the rather simple change that this debug required. The thread you posted suggests that this may be an issue elsewhere throughout all vanilla based mods and when I found the issue I was left to wonder about that because the 'fix' was made in a function I do not suspect we've ever modified.

Mind you, the problem with the gamestate doesn't begin there. Where it actually begins is a bigger mystery. How would you get two duplicate unit IDs on the same plot in the first place? In this case, that may have even happened in rounds that took place before this save. It won't cause a problem until the unit then tries to move again (or gets killed which results in a move to an invalid plot.)

A little further investigation should be done on this still I think. Perhaps I should place an assert on CvPlot's add unit function to try and catch when a duplicate unit ID is added to the list.

The same thinking applies to the bug you were discussing in the other thread. A NULL pointer is only returned by the GetUnit function when cycling through a list, be it a plot list or a selection group list if something has gone horribly wrong. Even eliminating a unit mid-loop should still not allow such a problem since the 'next' pointer is re-established properly during the removal process. removeunit not being run while the unit is otherwise deleted would cause this problem. Most of these reasons would be painfully obvious immediately upon being incorrectly programmed.

But if you have a situation like this one where the unit was somehow added to the list twice, in this case, the plot list, you get a crash when the unit dies and the assumption there's only one instance of its ID makes it only removed once while the rest of the unit data is eliminated, making it apparent why you then get a crash the next time the unit data is asked for since the plot list is saying it should still exist given that only one of the two instances listed there was removed when the unit died.

Ultimately, my point is:

Code:

if (pLoopUnit == NULL) { // Koshling - this really isn't a valid condition I think but it's been seen // and until we can understand the underlying cause, mitigate by ignoring the // invalid unit continue; }

as a fix is terribly flawed. As Nightinggale was pointing out, this is only addressing a symptom, not addressing the disease. There are so many places in the code where node lists are used that just patching up one instance, especially on plot node lists, of a null call will simply lead to another elsewhere. There's a reason this should be safe despite getUnit having a valid NULL response, and that is that if you get that NULL response, the problem is in a corruption in the node list, and that should never happen... unless there's a big problem elsewhere. If you don't fix that initial flawed gamestate creating error, you are going to set yourself up for later crashes and other larger problems down the line.

Only by following the debugger run very carefully was I able to figure out what happened as it was so far out of the box of any presumptions I might have had coming into the issue that I had to be forced to recognize what was actually taking place by watching the code process through every single step. Took me a while to narrow in on it. I would have thought it was impossible what was actually taking place.

In this case, the effort led to deeper analysis of what was happening here:

Code:

void CvPlot::removeUnit(CvUnit* pUnit, bool bUpdate) { CLLNode<IDInfo>* pUnitNode; pUnitNode = headUnitNode(); while (pUnitNode != NULL) { if (::getUnit(pUnitNode->m_data) == pUnit) { FAssertMsg(::getUnit(pUnitNode->m_data)->at(getX_INLINE(), getY_INLINE()), "The current unit instance is expected to be at getX_INLINE and getY_INLINE"); m_units.deleteNode(pUnitNode);

at m_units.deleteNode(pUnitNode);

Code:

template <class tVARTYPE> inline CLLNode<tVARTYPE>* CLinkList<tVARTYPE>::deleteNode(CLLNode<tVARTYPE>* pNode) { CLLNode<tVARTYPE>* pPrevNode; CLLNode<tVARTYPE>* pNextNode; FAssert(pNode != NULL); pPrevNode = pNode->m_pPrev; pNextNode = pNode->m_pNext; if ((pPrevNode != NULL) && (pNextNode != NULL)) { pPrevNode->m_pNext = pNextNode; pNextNode->m_pPrev = pPrevNode; }

When pNextNode also showed up as the same player and unit ID I did a double-take. Could there REALLY be a duplicate unit ID Node??? Sure enough there was. So the easiest 'fix' to correcting this was to back up and adjust removeUnit to:

Code:

while (pUnitNode != NULL) { if (::getUnit(pUnitNode->m_data) == pUnit) { FAssertMsg(::getUnit(pUnitNode->m_data)->at(getX_INLINE(), getY_INLINE()), "The current unit instance is expected to be at getX_INLINE and getY_INLINE"); m_units.deleteNode(pUnitNode); /*break;*///TBDebug: It has been found possible that a node may have somehow duplicated itself and thus the plot may have more than 1 of the node. If these duplicates aren't destroyed, we get a crash as well. Best case would be to somehow catch how this gets setup but in the debugging case that discovered this, it likely happened long beforehand. Jumptonearestvalidplot is the suspected culprit as it was an exile that showed the issue. } //else //{ pUnitNode = nextUnitNode(pUnitNode); //} }

I'd bet quite a bit that the same thing (a duplicate ID) was happening in the CVSelection remove Unit command in their mod.

Again though, now I need to figure out what happened to create this duplication!

That different OPs and compilers handle the issue differently isn't too surprising really. It's a matter of a difference in what would happen in the loop when it hits a null unit pointer ejecting from the while sequence rather than holding on and hitting a crash due to a null reference. Either that or the gamestate altering problem in their case was based on a random being generated that was varying on one system to the next.

Much of this response in our own bugs and crashes thread in C2C was directed at the discussion here after Alberts2 brought it to my attention.

I'm still on the warpath with this. I need to find out how the duplicate was created in the first place. But clearly the vanilla programmers knew that there should be the assumption that the node lists are accurate to the units that exist there and wanted a crash to let them know when it somehow went 'off' rather than just wanting to patch over an incorrect gamestate issue. I think that was probably smart of them really.

As you can see, though, for now, I've enabled the removal code to consider the possibility that there may be duplicates and to continue looking for that possibility to remove them as well if they should exist. Sadly, this may allow for a slightly longer turn time so its still worth finding the origination of the issue. I'm hoping I can assume its not some kind of memory flaw issue.

Some more discussion between Alberts and I that bears pertinence to this thread's bug:

alberts2 said:
alberts2 said:

That was exactly the mistake because the pUnitNode pointer is invalid as soon as the unit with the id that pointer is pointing to is removed from that plot for whatever reason during the loop. The memory segment that pointer points to is deallocated then the unit gets removed from the plot. From that point on that memory segment can be overwritten with other data and the pointer points to something else.

But this problem only became visible in newer windows versions because they have more aggressive memory handling to reduce the memory usage.

The fix that i posted in that thread works around the problem in that function. But there can be lots of other possible issues and duplicates can be possible too.

Click to expand...

Theory:
Given that the node list is readjusted properly when a unit is removed from the list, the problem is not so much that a unit was removed mid-loop, but that the loop was programmed incorrectly, which you proved with your adjustment. However, you did a very interesting trick to proxy the loop to get around it. But could it be as simple as this...

The original loop code started in this manner:

Code:

while (pUnitNode != NULL) { pLoopUnit = ::getUnit(pUnitNode->m_data); pUnitNode = nextUnitNode(pUnitNode);

By putting this line : pUnitNode = nextUnitNode(pUnitNode);
immediately after : pLoopUnit = ::getUnit(pUnitNode->m_data);
we've already switched pUnitNode over to the NEXT node before the actual processing in the loop takes place.

So could their error have been more simply addressed by placing 'pUnitNode = nextUnitNode(pUnitNode);' at the END of the While brackets, AFTER the processing has already taken place?

I'm almost sure I've had to make this change a time or two already to fix some previous crashes and while it made sense at the time, I didn't bring it up in the forums because I would have assumed it was a pretty basic issue.

Perhaps if I see another crash like that I'll try to address it in that way before freaking out over it otherwise.

AbsintheRed · Sep 20, 2016

I'd bet quite a bit that the same thing (a duplicate ID) was happening in the CVSelection remove Unit command in their mod.

Again though, now I need to figure out what happened to create this duplication!

That different OPs and compilers handle the issue differently isn't too surprising really. It's a matter of a difference in what would happen in the loop when it hits a null unit pointer ejecting from the while sequence rather than holding on and hitting a crash due to a null reference. Either that or the gamestate altering problem in their case was based on a random being generated that was varying on one system to the next.

It can easily be that it was also a duplicate node in RFCE.
Surely it also happened when the game tried to teleport some units because of border change on city conquest.

Thunderbrd · Sep 20, 2016

AbsintheRed said:
It can easily be that it was also a duplicate node in RFCE.
Surely it also happened when the game tried to teleport some units because of border change on city conquest.

Alberts has assured me it was the removal of some units from the list during the list's loop that caused the problem here. That's a little different than the issue I was dealing with.

Nightinggale · Sep 20, 2016

Thunderbrd said:
Alberts has assured me it was the removal of some units from the list during the list's loop that caused the problem here.

Generally speaking, looping a list and removing elements while looping is asking for trouble. It's doable, but you do have to consider what happens if you do so. For instance say you loop using indexes. You reach index 2 and remove it and then you increase the counter. What was 3 before becomes 2 and since the counter is increased, the new 2 isn't checked in the loop. The end condition might assume 10 elements, but after one is removed, only 9 remain, resulting in a crash when accessing the 10th. There are plenty of things, which can go wrong unless you consider what you are doing quite carefully.

I haven't looked at this specific loop and can't tell precisely what goes wrong, but it would seem that it leaves NULL pointers behind. In theory that's ok if you loop afterwards and remove the NULL pointers, but if you can do that, then why not remove them before/when they are added?

Also good job at finding the issue. It took a while, but better late than never. As I wrote in one of the posts, some bugs makes a whole lot more sense in retrospect.

Strange CTD on cityconquest

Deity

Deity

Deity

Deity

Emperor

Deity

Emperor

Deity

Deity

Deity

Emperor

Deity

Deity

Deity

Emperor

Deity

C2C War Dog

Deity

C2C War Dog

Deity

Similar threads