What Koshling has been doing lately

Koshling · May 17, 2013

As you may have noticed, I have not been very active on the SVN for a few weeks.

This is because, shortly before the release of V30, I decided to spend some time looking into multi-core enabling aspects of the DLL. In doing this I chose to start with the path generator. This choice was for two reasons:

It's a relatively isolated piece of code, with few external dependencies
It was the single largest cost in turn processing for mature games

Although I have by no means finished working on it, I now have a sufficiently stable version to push to SVN, so today I have done just that, with a fairly extensive revision of the path generation code. Externally the main thing you should see is improved performance, but there are some bug fixes in there too.

In terms of my original intent I have not (yet) been entirely successful, but in terms of overall performance I'm reasonably happy, so I'll start there.

My primary benchmark is a save that Talin provided for a bug fix a month or so ago. This is a fairly mature game on a fairly large map. On my machine, the V30 turn time for this save is about 6 minutes, of which about 2 minutes 40 seconds is path generation. The version just pushed executes the pathing calculations in about 54 seconds, so the net result is approximately a 4 minute turn. More gratifyingly the gain is all in pathing, so that's really an improvement from 155 seconds to 54 - a factor of 3 roughly.

Although I do plan to continue with more work on the pathing (you'll see why later), I intend to move onto other aspects of turn time next (city processing is now the dominant factor, so it's next for some intensive scrutiny). This cycle (so until V31) I don't plan to do anything except improve performance and fix major bugs.

So anyway, here's a little more detail, since things did not go entirely as I had expected, or initially hoped!

My first attempt had three big problems:

1) Because I was restricting my work to path generator internals, the thread switching/scheduling latency was comparable to (or larger than) the time to generate a typical path (order of 1 milli-second). This meant that it was extremely hard to get performance gains from multiple threads, since the startup time of a background thread processing any request was comparable to the entire time to complete generation of a path. Although I was able to get around this to some degree the net effect is that locking costs quickly chew up any potential benefit, and scaling with thread count was extremely poor.

2) Multi-threading introduces a degree of non-determinism in the order things get done, and unless very severe constraints are imposed, this results in some non-determinism in the path generated when there are multiple optimal paths. In a multi-player environment this would lead to OOS errors, so it's not acceptable (worst case of course, is that you can disable multi-core processing in MP games, but that's not ideal)

3) Most seriously of all, debugging a more complex implementation, and especially a multi-threaded one is a lot harder. I quickly found that errors were occurring that I couldn't reliably reproduce when I ran long turns.

At that point I decided to step back and construct an extensive self-testing framework for the path generation. This works as follows:

i) I added back the original path generator code, and kept the new version as a separate class, so I could instantiate both at once.
ii) I added a random-number (with fixed seed) driven path test case generator. This is triggered (in the debug version) by pressing ctrl-alt-P and generates 10000 random paths against the current map in groups of 30, each group beginning at the same random (land) starting point and each path in the group choosing a random end point within a distance of 30 from the start (choosing totally random end points from anywhere on the map would produce far too many paths that could not be completed - limiting the range both simulates better the kind of paths the game generates normally, and increases the proportion that are possible to path)
iii) Each group of paths is generated with each path generator (the 'known quantity' old one, and the under-test new one). The code flags up any discrepancies in ability to generate the path, total length of the generated path, and total cost of the generated path. Any such discrepancy is considered a test failure

Because this is a deterministic (but randomized) set of paths, it means that every map I can use will give a different test set, but for any given save the same set will always be generated.

This enabled me to fairly rigorously test changes as I made them. It also served to identify bugs in the existing implementation (of which I found about half a dozen), since when differences occurred, they sometimes turned out to be bugs in the original that didn't manifest in the new version.

This process also gave me a much better feel for the circumstances and condition that were occurring during path generation, and sparked a range of ideas for optimizations. Since it is way easier to debug a single-threaded implementation than a multi-threaded one (and since the latency issues were making good multi-core scaling hard to achieve), I decided to concentrate on making a number of single-threaded optimization first, and then to multi-core enable the resulting optimized generator.

In the end the single-threaded optimizations were so successful that I have checked them in (that's basically today's SVN push), and my expectation is that it will turn out to not be worthwhile to multi-thread at this level (I expect to get maybe 20% more gain from throwing an entire second core at it, which is not worthwhile scaling, given that it will reduce the opportunity for the first core to turbo up on modern machines). However, I plan to continue and verify that, before deciding that multicore enablement at this level is a dead end.

In practice, to get it scaling well across multiple cores, I think the parallelism has to be undertaken at a higher level (generating multiple paths in parallel rather than several threads working on any one path). Unfortunately that is a LOT more complex to achieve (needs much more locking, and involves much less contained areas of code), and itself may have limited scaling, because currently the cost-trees generated by one path generation attempt are re-used in the next (for the same group), and parallelizing this will result in potentially significantly more overall work, as multiple parallel attempts work in sub-optimal directions because they don't have predecessor-derived information to guide them.

This latter effect may mean that good scaling cannot be achieved without processing multiple groups simultaneously, which again extends the scope of locking needed, and (at least naively) introduces race conditions that could result in multiple groups trying to address the same goal (those are all solvable by yet more locking of course, but the complexity goes up at each step). Any attempt to parallelize at this level will also be VERY, VERY hard to make OOS-safe in an MP game environment (because groups will be processed in a non-deterministic order, and since the result of one group's decision making constrains that of other's, that will tend to make the overall result non-deterministic). This non-determinism will also mean that bugs will have less reproducibility.

My intended next steps are as follows:

Switch to looking at the optimization of city processing. Currently I think this CAN be done effectively in parallel, but I also have some ideas for significant speed ups that do not involve multi-threading. I expect to make significant gains here before V31, and since this is (now) the dominant cost for mature-game turn processing, I expect further gains of several 10s of percent in turn times over and above the version pushed today.
More pathing optimizations, including experimental verification that low-level parallelism isn't a sufficient gain to justify its use.
Another approach to trying to use a more modern compiler for parts of the code (initially small parts, but that is just proof-of-concept), so as to allow compiler optimization for more modern processor architectures than the current (2003) compiler performs (this is known to be worth something of the order of a doubling of overall performance if we can achieve it)

ls612 · May 17, 2013

I'll focus on one part of this for now.

Another approach to trying to use a more modern compiler for parts of the code (initially small parts, but that is just proof-of-concept), so as to allow compiler optimization for more modern processor architectures than the current (2003) compiler performs (this is known to be worth something of the order of a doubling of overall performance if we can achieve it)

I thought you tried this in January and determined that it was not possible to do this due to binary incompatabilities. Something about passing objects from the DLL to the engine. Has something changed since then or am I missing something obvious?

Koshling · May 17, 2013

ls612 said:
I'll focus on one part of this for now.

I thought you tried this in January and determined that it was not possible to do this due to binary incompatabilities. Something about passing objects from the DLL to the engine. Has something changed since then or am I missing something obvious?

Correct. But I had another idea I want to try...

Thunderbrd · May 17, 2013

I've read this and understood the majority of it. All I can really say is you're doing a wickedly awesome job on improving the mod (should I say the GAME in this context?)!!!

I'm VERY happy to see how thoroughly you're giving the determinism consideration for multi-play! I know we have a ton of OOS's to fix there and I'm hoping we can soon get on top of those. So introducing a hopelessly impossible new factor to resolve into the mix is a big concern with this. But I can clearly see that you're making every effort to respect this side of the game and its extraordinarily appreciated! I'd wonder how much you could really improve the processing speed on multi-play where pathing is concerned anyhow given that all that pathing processing is done while players are taking their turns as it is. So if it cannot be improved there with multi-threading, I don't suspect on THIS matter it'd be much of a loss. Perhaps other issues that still get processed for the ai in the in-between turn times could potentially use some benefit there, but what would then happen if one computer on the network has multiple cores while another still does not? (Not that this would be a concern much even now as most single processor only computers still in use probably already find C2C too heavy as it is...) But what about different #'s of cores between systems on the connection? Yeah... that sounds like it could be a bit of a mess trying to multi-thread and maintain determinism. But I can't say I fully understand a whole lot on that subject so this only comes from my very limited perspective on the matter.

I know that my wife got frustrated not far into the game on single player with the turn time delays and I'm sure that there's a lot of players that would feel the same way so these kinds of turn time improvements end up being perhaps the most tremendous benefit we can possibly provide for C2C! Well done! Looking forward to more!

Hydromancerx · May 17, 2013

Keep up the great work Koshling! I hope something works out. However that is still awesome news for the stuff you pushed to the SVN! :goodjob:

AIAndy · May 17, 2013

About multithreading:
I think the best level to introduce multithreading in the AI is on the information gathering level. A lot of the AI code consists of calculating values of different possible decisions and then making a decision based on that.

As long as you don't actually act while you gather that information, there is no locking needed as the underlying data is not changed. Afterwards you can then decide based on all the return values of the different threads. This also means that the result does not depend on the order of thread execution and is therefore deterministic.

It might require some changing of the decision structure to get this done properly.
And it reacts badly to caches that are not thread specific. If the calculation of the cache value is deterministic though and therefore not dependent on when during the threaded part the cache result is calculated, locks will still allow to use the cache structure.

Zain · May 17, 2013

I read through it and I have to say that's truly amazing!
I can't comment much on the technical details, but on usability side of things I agree with Thunderbrd:

Thunderbrd said:
I know that my wife got frustrated not far into the game on single player with the turn time delays and I'm sure that there's a lot of players that would feel the same way so these kinds of turn time improvements end up being perhaps the most tremendous benefit we can possibly provide for C2C! Well done! Looking forward to more!

C2C is very large and still expanding, which is great. But at this point it's so large that you have to take a step back and view it from a wider angle; you have to improve overall usability to accommodate such a tremendous amount of content. And I think that's exactly what you've been doing. Improving load times, turn times, UI and other things usability gives the user an opportunity to enjoy the already rich content.

Koshling · May 17, 2013

AIAndy said:
About multithreading:
I think the best level to introduce multithreading in the AI is on the information gathering level. A lot of the AI code consists of calculating values of different possible decisions and then making a decision based on that.

As long as you don't actually act while you gather that information, there is no locking needed as the underlying data is not changed. Afterwards you can then decide based on all the return values of the different threads. This also means that the result does not depend on the order of thread execution and is therefore deterministic.

It might require some changing of the decision structure to get this done properly.
And it reacts badly to caches that are not thread specific. If the calculation of the cache value is deterministic though and therefore not dependent on when during the threaded part the cache result is calculated, locks will still allow to use the cache structure.

In principle yes, but it wold require a fairly major change, since the ai structure is basically to try things in priority order, so evaluating alternate actions will often result n evaluating things that turn out to be lower priority than the one actually executed, which means you're doing more work than you did when processing serially. Depending on how much more work you wind up doing it might simply be not worthwhile at all. Conversely if you evaluate group actions in parallel there is very little lost effort, but a degree of determinism is lost.

There is probably a compromise based on optimistic locking, whey you evaluate multiple groups in parallel and queue up the results, actioning them serially (in deterministic group ordering), recalculating when a previously calculated best action has been invalidated (probably via an optimistic locking scheme). This is probably the best way to go, but its also going to be one of the more complex of the possible options.

Snofru1 · May 18, 2013

I am very impressed, as usual when Koshling is pushing forward things!

You have mentioned that you have also found and eliminated a number of pathing errors. Can you elaborate a little bit on this? Were you able to isolate the notorious issue with paths changing immediately when you stop holding the right mouse button (and thereby changing from a good path to a bad one)?

Koshling · May 18, 2013

Snofru1 said:
I am very impressed, as usual when Koshling is pushing forward things!

You have mentioned that you have also found and eliminated a number of pathing errors. Can you elaborate a little bit on this? Were you able to isolate the notorious issue with paths changing immediately when you stop holding the right mouse button (and thereby changing from a good path to a bad one)?

No, that one isn't (at least it sounds like it isn't) a pathing engine problem pre se, but rather something going wrong in the way the UI pathing display is slaved to the C2C replacement pathing engine. Because the UI display of paths is built into the main game EXE (which we cannot control) it has to use the original BTS pathing engine, so what we do is slave that to the path generated by our pathing engine by generating totally artificial node costs when asked by the game engine (cost 0 for an edge on the desired path, infinity for any other essentially). This forces the BTS pathing engine to follow the already-determined path our engine calculated. This only applies to paths displayed in the UI (not to any usage by the AI or automations etc.), and the symptom you describe sounds more likely to be a problem in that mechanism.

As to the bugs that WERE fixed - I can't really describe them in symptomatic terms - they were just things that caused it follow sub-optimal paths, as discovered by the automated tests I have now added. They would have manifested as units following longer/less desirable paths than they should have done.

Sebastian2203 · May 18, 2013

Wow , Im happy to see you are putting all of your efforts in it . Can I donate you somehow ?

Snofru1 · May 19, 2013

Koshling said:
No, that one isn't (at least it sounds like it isn't) a pathing engine problem pre se, but rather something going wrong in the way the UI pathing display is slaved to the C2C replacement pathing engine. Because the UI display of paths is built into the main game EXE (which we cannot control) it has to use the original BTS pathing engine, so what we do is slave that to the path generated by our pathing engine by generating totally artificial node costs when asked by the game engine (cost 0 for an edge on the desired path, infinity for any other essentially). This forces the BTS pathing engine to follow the already-determined path our engine calculated. This only applies to paths displayed in the UI (not to any usage by the AI or automations etc.), and the symptom you describe sounds more likely to be a problem in that mechanism.

As to the bugs that WERE fixed - I can't really describe them in symptomatic terms - they were just things that caused it follow sub-optimal paths, as discovered by the automated tests I have now added. They would have manifested as units following longer/less desirable paths than they should have done.

Thank you for that clarifying answer! So it means that the problem could be solved or at least be improved on by your changes by now following better paths after releasing the right mouse key. I will look closely in the future on bad pathings and provide savegames and descriptions when I observe them. Unfortunately I don´t have the time to play currently :sad:

.

What I saw in the past was:

Units that left my borders unnecessarily and could be killed by Barbarians/Animals.
Units that left the road system and returned on it thereby needing one or more additional turns to reach the goal.
Units not following the quickest road type available.
A typical effect was selecting a target by holding the right mouse key, seeing something like "3 turns" and then following a path that takes 4 turns.

Koshling · May 19, 2013

Snofru1 said:
Thank you for that clarifying answer! So it means that the problem could be solved or at least be improved on by your changes by now following better paths after releasing the right mouse key. I will look closely in the future on bad pathings and provide savegames and descriptions when I observe them. Unfortunately I don´t have the time to play currently .

What I saw in the past was:

Units that left my borders unnecessarily and could be killed by Barbarians/Animals.

Units that left the road system and returned on it thereby needing one or more additional turns to reach the goal.

Units not following the quickest road type available.

A typical effect was selecting a target by holding the right mouse key, seeing something like "3 turns" and then following a path that takes 4 turns.

If you get such a case again reload from the previous auto and see if it is reproducible. If it is please post with instructions on how to reproduce, and I'll work on it.

strategyonly · Jan 3, 2014

So when is it we are looking at, to get back to C2C??

Nimek · Jan 11, 2014

@Koshling

I miss you. You were my personal dll modder hero. Please come back

I am very happy that we have alberts2 now.

What Koshling has been doing lately

Koshling

Vorlon

ls612

Deity

Koshling

Vorlon

Thunderbrd

C2C War Dog

Hydromancerx

C2C Modder

AIAndy

Deity

Zain

Chieftain

Koshling

Vorlon

Snofru1

Emperor

Koshling

Vorlon

Sebastian2203

Chieftain

Snofru1

Emperor

Koshling

Vorlon

strategyonly

C2C Supreme Commander

Nimek

Emperor

Similar threads