Multi Player bugs and crashes - After the 16th of May

Ok, if that didn't work it mean you're probably hitting an OOS issue that isn't related to the city changes

Edit - your logs show it going out of sync during a unit move, so it's definitely not the city stuff. However, I cannot diagnose further without the RandomLogger logs, which you have not included (should be called 'RandomLogger - Player 0 - Set <some number>', and similarly for Player 1)

That was what our last OOS appeared to be a result of. Something in the pathing mechanism perhaps. Without posting all the logs, I'll explain what happened.

At first I selected to move a pair of grouped units to a plot within my wife's borders. They took one step of two and stopped, presumably because they'd recognized a new animal threat exposed. I then simply re-ordered them to that spot since I wanted them to ignore it and blammo OOS. On her machine she could see the units in the spot where they first stopped but on mine they had moved on.

I dunno if that helps at all... There isn't any randomness in what path the unit chooses is there? Random logs may prove unhelpful as a result.
 
Ok, I'm looking at the random logs on the logs I gave you and it looks like:
1) one of us may have the multiple religion spread bug option on while the other does not (somehow). GRRRR... bug options are so frustrating for multi-player!!! It's either that or the multiple religion spread was handled differently on one computer to the next during the multi-threading process (and that could be the whole issue really.)

2) Somehow it seems to stay in synch despite the above. UNTIL it encounters a city evaluation where all hell seems to break loose then somehow patches itself back together again... honestly all Random logs on OOS errors I've seen before collapse completely and differ on ALL results past the first one that doesn't match so there's definitely something going on there with the multi-threading making it possible for many results to match despite some going haywire.

3) It looks like the player 1 side is doing more evaluations than the player 2... but then again it COULD be doing all of the same and just not matching up when they register - and THAT would certainly be down to something in the multi-threading reporting in an async manner. Unfortunately I don't think its going to be helpful in telling us where that takes place except to suggest its the delivery point that varies and that alone is causing the issue.

I suppose a line count on each could be helpful to see if any processes are not taking place on one computer vs the other.


In looking at the OOS error log that does match to the other, right away we see we're starting with two differing sets of random numbers so we lost synch in the random structure somewhere. Next, we see this affected a combat result significantly - a tracker is injured on one and uninjured on the other. A couple more like that then we also see that it affected the decision of Player 5 who now has a building on one computer's report that the other doesn't show. And on one it shows a city that isn't building anything while on the other it's working on a knowledge inheritence.

It goes on and on but the really tough part here is that we CANNOT isolate WHEN the OOS takes place because the multi-threading reports the random results in an out of order manner on each computer. This is about all I can really make of it.
 
That was what our last OOS appeared to be a result of. Something in the pathing mechanism perhaps. Without posting all the logs, I'll explain what happened.

At first I selected to move a pair of grouped units to a plot within my wife's borders. They took one step of two and stopped, presumably because they'd recognized a new animal threat exposed. I then simply re-ordered them to that spot since I wanted them to ignore it and blammo OOS. On her machine she could see the units in the spot where they first stopped but on mine they had moved on.

I dunno if that helps at all... There isn't any randomness in what path the unit chooses is there? Random logs may prove unhelpful as a result.

If this is human-ordered pathing then there is no randomness involved, and it must be something wrong with the way commands are being sent - sounds like the second command is not syncing for some reason. This definitely will not have been modified by any recent changes.

Ok, I'm looking at the random logs on the logs I gave you and it looks like:
1) one of us may have the multiple religion spread bug option on while the other does not (somehow). GRRRR... bug options are so frustrating for multi-player!!! It's either that or the multiple religion spread was handled differently on one computer to the next during the multi-threading process (and that could be the whole issue really.)

2) Somehow it seems to stay in synch despite the above. UNTIL it encounters a city evaluation where all hell seems to break loose then somehow patches itself back together again... honestly all Random logs on OOS errors I've seen before collapse completely and differ on ALL results past the first one that doesn't match so there's definitely something going on there with the multi-threading making it possible for many results to match despite some going haywire.

3) It looks like the player 1 side is doing more evaluations than the player 2... but then again it COULD be doing all of the same and just not matching up when they register - and THAT would certainly be down to something in the multi-threading reporting in an async manner. Unfortunately I don't think its going to be helpful in telling us where that takes place except to suggest its the delivery point that varies and that alone is causing the issue.

I suppose a line count on each could be helpful to see if any processes are not taking place on one computer vs the other.


In looking at the OOS error log that does match to the other, right away we see we're starting with two differing sets of random numbers so we lost synch in the random structure somewhere. Next, we see this affected a combat result significantly - a tracker is injured on one and uninjured on the other. A couple more like that then we also see that it affected the decision of Player 5 who now has a building on one computer's report that the other doesn't show. And on one it shows a city that isn't building anything while on the other it's working on a knowledge inheritence.

It goes on and on but the really tough part here is that we CANNOT isolate WHEN the OOS takes place because the multi-threading reports the random results in an out of order manner on each computer. This is about all I can really make of it.

I'm going to modify the random logs as follows (at least current plan):

1) Make them tabular (tab separated records)

2) Add an extra field for the generator stream they originate from (no that we have more than 1 stream due to the multi-threaded code's approach to keeping things in sync). The fields will then be:

2.1) Stream name ('Global' or the name of a city)
2.2) Reason for number being needed (the current text you get)
2.3) The range it was asked for
2.4) The number generated
2.5) Line number (shows origination order)

All but (2.1) (which is the new data) are there now (well (2.5) is implicit currently). The change to tabular format will allow the log to be read into a spreadsheet, which can then easily be sorted by stream name and allow direct comparison
 
Rev. 5537
OOS every turn, I think it's related to koshling latest "end turn" optimizations.

View attachment 352261

On a side note:

-No revolutions.
-Identical user settings for all players.
-cleared caches with "shift" on game launch for all players.

Are there any other particular tricks to avoid OOS?

Edit: Will try again with pipeline threads set to 1 in A_New_Dawn_Globals, further reports incoming.
 
Rev. 5537
OOS every turn, I think it's related to koshling latest "end turn" optimizations.

View attachment 352261

On a side note:

-No revolutions.
-Identical user settings for all players.
-cleared caches with "shift" on game launch for all players.

Are there any other particular tricks to avoid OOS?

Edit: Will try again with pipeline threads set to 1 in A_New_Dawn_Globals, further reports incoming.
I think you can disregard this error report, on further inspection it looks like there might be something wrong with my installation of C2C and I'm downloading it anew entirely from SVN before trying again.
 
If this is human-ordered pathing then there is no randomness involved, and it must be something wrong with the way commands are being sent - sounds like the second command is not syncing for some reason. This definitely will not have been modified by any recent changes.
Yeah, I think this one has probably been around since the changes made in the pathing engine in the first place. It's not terribly common so maybe it has something to do with a difference made in the state of the unit by the way the unit was stopped? The same could then be taking place from an AI's experience as well but not due to a random check at all.



I'm going to modify the random logs as follows (at least current plan):

1) Make them tabular (tab separated records)

2) Add an extra field for the generator stream they originate from (no that we have more than 1 stream due to the multi-threaded code's approach to keeping things in sync). The fields will then be:

2.1) Stream name ('Global' or the name of a city)
2.2) Reason for number being needed (the current text you get)
2.3) The range it was asked for
2.4) The number generated
2.5) Line number (shows origination order)

All but (2.1) (which is the new data) are there now (well (2.5) is implicit currently). The change to tabular format will allow the log to be read into a spreadsheet, which can then easily be sorted by stream name and allow direct comparison

Good ideas I'd say. It'd certainly help!
 
Yeah, I think this one has probably been around since the changes made in the pathing engine in the first place. It's not terribly common so maybe it has something to do with a difference made in the state of the unit by the way the unit was stopped? The same could then be taking place from an AI's experience as well but not due to a random check at all.

I do not believe pathing engine changes could really cause this either. Pathing just leads to a particular destination being pushed as an order. If the execution of the GOTO mission is interrupted (e.g. - by an enemy unit) the effect of that is all inside of CvSelectionGroup::continueMission() which is unchanged with the pathing engine changes.

More likely a much older bug that doesn't cope with interrupted mission execution is at work I would say. If you auto-save each turn then it sounds to me like the situation you described as to when it occurred (in your last case anyway) would be reproducible...
 
I do not believe pathing engine changes could really cause this either. Pathing just leads to a particular destination being pushed as an order. If the execution of the GOTO mission is interrupted (e.g. - by an enemy unit) the effect of that is all inside of CvSelectionGroup::continueMission() which is unchanged with the pathing engine changes.

More likely a much older bug that doesn't cope with interrupted mission execution is at work I would say. If you auto-save each turn then it sounds to me like the situation you described as to when it occurred (in your last case anyway) would be reproducible...

Probably so. Last time we checked autosaving every turn doesn't work either (is bugged itself.) But we can try it again.

Do you have the two computer systems to run through an OOS situation then?
 
Probably so. Last time we checked autosaving every turn doesn't work either (is bugged itself.) But we can try it again.

Do you have the two computer systems to run through an OOS situation then?
It is also possible to run C2C twice on the same PC.
 
It is also possible to run C2C twice on the same PC.

But wouldn't you have problems linking the two open games to each other in an IP connection where they'd have the same IP?
 
Probably so. Last time we checked autosaving every turn doesn't work either (is bugged itself.) But we can try it again.

Do you have the two computer systems to run through an OOS situation then?
I do not
It is also possible to run C2C twice on the same PC.
How does one achieve this? (a simple re-run of the main executable doesn't start a second instance, at least for me)
But wouldn't you have problems linking the two open games to each other in an IP connection where they'd have the same IP?
Given it's (I assume?) a fixed port this could be an issue
 
Given it's (I assume?) a fixed port this could be an issue
Presumably only the game hosting computer binds to a port, though. The other doesn't need to listen for incoming connections.

edit: Or processes, in this case.
 
I'm adding a bunch of logs from this morning; I hope there's something useful in there.
 

Attachments

  • Player2.zip
    1,022.6 KB · Views: 29
  • Logs.zip
    1.8 MB · Views: 33
I'm adding a bunch of logs from this morning; I hope there's something useful in there.

They are useful, but not to the extent that I can see the problem.

@AIAndy/TB - the random logs show turn 651 going OOS because a combat requires an extra round on one machine relative to the other. I cannot see an issue in the resolveCombat() method that might explain this, so the implication (I think) is that the units came into the fight with differing stats (health or promotions). However, I cannot think of a convincing explanation for how that situation could arise without an earlier mismatched roll. Any ideas...?

PS - one thing I'm slightly suspicious of is that some promotion check triggered by an async action (a user looking at a units promotion list or something) has invalidated a promotion as a result (in an unsynced manner). I don't have any concrete reason for this, but we have had a lot of issues with promotions being lost unexpectedly lately - if a check is triggerable from an async action (basically a query of the promotion state rather than an explicit attempt to change it) this might account for things.
 
Just one quick question of understanding: is there the possibility to have a "safety net" for multiplayer games, that works like a command that the OOSing machine checks the different data and for example accepts the differing data of the other and rewrites his, so one has always the "dominant" data in times of question if an unidentified OOS occurs and the other machine is compliant?

Sort of ingame "master-slave autosync" without reload?
 
They are useful, but not to the extent that I can see the problem.

@AIAndy/TB - the random logs show turn 651 going OOS because a combat requires an extra round on one machine relative to the other. I cannot see an issue in the resolveCombat() method that might explain this, so the implication (I think) is that the units came into the fight with differing stats (health or promotions). However, I cannot think of a convincing explanation for how that situation could arise without an earlier mismatched roll. Any ideas...?

PS - one thing I'm slightly suspicious of is that some promotion check triggered by an async action (a user looking at a units promotion list or something) has invalidated a promotion as a result (in an unsynced manner). I don't have any concrete reason for this, but we have had a lot of issues with promotions being lost unexpectedly lately - if a check is triggerable from an async action (basically a query of the promotion state rather than an explicit attempt to change it) this might account for things.

One user looking over their promotions shouldn't trigger any losses. If they select an earned promotion, that could... and a recalc could. But either way that would be something that would stay in synch wouldn't it? (unless it has something to do with once a promotion is taken the ensuing check then stays asynch and of course there's no message design in that since I would have assumed this part of the selection process was back in synch... hmm... That might be something to look into perhaps.)

But if that were the case, I'd think this sort of issue would come up more often... I could be wrong. However another possible issue could be, again, differing bug settings. There's a few bug options, opportunity fire, terrain damage, archery bombard, Storms to name a few, that could account for one unit having more or less strength on one or the other.

Events can also commonly be culprits. Could one of the events that either gives out promos or heals a unit in the field be to blame?
 
Top Bottom