As you may have noticed, I have not been very active on the SVN for a few weeks.
This is because, shortly before the release of V30, I decided to spend some time looking into multi-core enabling aspects of the DLL. In doing this I chose to start with the path generator. This choice was for two reasons:
Although I have by no means finished working on it, I now have a sufficiently stable version to push to SVN, so today I have done just that, with a fairly extensive revision of the path generation code. Externally the main thing you should see is improved performance, but there are some bug fixes in there too.
In terms of my original intent I have not (yet) been entirely successful, but in terms of overall performance I'm reasonably happy, so I'll start there.
My primary benchmark is a save that Talin provided for a bug fix a month or so ago. This is a fairly mature game on a fairly large map. On my machine, the V30 turn time for this save is about 6 minutes, of which about 2 minutes 40 seconds is path generation. The version just pushed executes the pathing calculations in about 54 seconds, so the net result is approximately a 4 minute turn. More gratifyingly the gain is all in pathing, so that's really an improvement from 155 seconds to 54 - a factor of 3 roughly.
Although I do plan to continue with more work on the pathing (you'll see why later), I intend to move onto other aspects of turn time next (city processing is now the dominant factor, so it's next for some intensive scrutiny). This cycle (so until V31) I don't plan to do anything except improve performance and fix major bugs.
So anyway, here's a little more detail, since things did not go entirely as I had expected, or initially hoped!
My first attempt had three big problems:
1) Because I was restricting my work to path generator internals, the thread switching/scheduling latency was comparable to (or larger than) the time to generate a typical path (order of 1 milli-second). This meant that it was extremely hard to get performance gains from multiple threads, since the startup time of a background thread processing any request was comparable to the entire time to complete generation of a path. Although I was able to get around this to some degree the net effect is that locking costs quickly chew up any potential benefit, and scaling with thread count was extremely poor.
2) Multi-threading introduces a degree of non-determinism in the order things get done, and unless very severe constraints are imposed, this results in some non-determinism in the path generated when there are multiple optimal paths. In a multi-player environment this would lead to OOS errors, so it's not acceptable (worst case of course, is that you can disable multi-core processing in MP games, but that's not ideal)
3) Most seriously of all, debugging a more complex implementation, and especially a multi-threaded one is a lot harder. I quickly found that errors were occurring that I couldn't reliably reproduce when I ran long turns.
At that point I decided to step back and construct an extensive self-testing framework for the path generation. This works as follows:
i) I added back the original path generator code, and kept the new version as a separate class, so I could instantiate both at once.
ii) I added a random-number (with fixed seed) driven path test case generator. This is triggered (in the debug version) by pressing ctrl-alt-P and generates 10000 random paths against the current map in groups of 30, each group beginning at the same random (land) starting point and each path in the group choosing a random end point within a distance of 30 from the start (choosing totally random end points from anywhere on the map would produce far too many paths that could not be completed - limiting the range both simulates better the kind of paths the game generates normally, and increases the proportion that are possible to path)
iii) Each group of paths is generated with each path generator (the 'known quantity' old one, and the under-test new one). The code flags up any discrepancies in ability to generate the path, total length of the generated path, and total cost of the generated path. Any such discrepancy is considered a test failure
Because this is a deterministic (but randomized) set of paths, it means that every map I can use will give a different test set, but for any given save the same set will always be generated.
This enabled me to fairly rigorously test changes as I made them. It also served to identify bugs in the existing implementation (of which I found about half a dozen), since when differences occurred, they sometimes turned out to be bugs in the original that didn't manifest in the new version.
This process also gave me a much better feel for the circumstances and condition that were occurring during path generation, and sparked a range of ideas for optimizations. Since it is way easier to debug a single-threaded implementation than a multi-threaded one (and since the latency issues were making good multi-core scaling hard to achieve), I decided to concentrate on making a number of single-threaded optimization first, and then to multi-core enable the resulting optimized generator.
In the end the single-threaded optimizations were so successful that I have checked them in (that's basically today's SVN push), and my expectation is that it will turn out to not be worthwhile to multi-thread at this level (I expect to get maybe 20% more gain from throwing an entire second core at it, which is not worthwhile scaling, given that it will reduce the opportunity for the first core to turbo up on modern machines). However, I plan to continue and verify that, before deciding that multicore enablement at this level is a dead end.
In practice, to get it scaling well across multiple cores, I think the parallelism has to be undertaken at a higher level (generating multiple paths in parallel rather than several threads working on any one path). Unfortunately that is a LOT more complex to achieve (needs much more locking, and involves much less contained areas of code), and itself may have limited scaling, because currently the cost-trees generated by one path generation attempt are re-used in the next (for the same group), and parallelizing this will result in potentially significantly more overall work, as multiple parallel attempts work in sub-optimal directions because they don't have predecessor-derived information to guide them.
This latter effect may mean that good scaling cannot be achieved without processing multiple groups simultaneously, which again extends the scope of locking needed, and (at least naively) introduces race conditions that could result in multiple groups trying to address the same goal (those are all solvable by yet more locking of course, but the complexity goes up at each step). Any attempt to parallelize at this level will also be VERY, VERY hard to make OOS-safe in an MP game environment (because groups will be processed in a non-deterministic order, and since the result of one group's decision making constrains that of other's, that will tend to make the overall result non-deterministic). This non-determinism will also mean that bugs will have less reproducibility.
My intended next steps are as follows:
This is because, shortly before the release of V30, I decided to spend some time looking into multi-core enabling aspects of the DLL. In doing this I chose to start with the path generator. This choice was for two reasons:
- It's a relatively isolated piece of code, with few external dependencies
- It was the single largest cost in turn processing for mature games
Although I have by no means finished working on it, I now have a sufficiently stable version to push to SVN, so today I have done just that, with a fairly extensive revision of the path generation code. Externally the main thing you should see is improved performance, but there are some bug fixes in there too.
In terms of my original intent I have not (yet) been entirely successful, but in terms of overall performance I'm reasonably happy, so I'll start there.
My primary benchmark is a save that Talin provided for a bug fix a month or so ago. This is a fairly mature game on a fairly large map. On my machine, the V30 turn time for this save is about 6 minutes, of which about 2 minutes 40 seconds is path generation. The version just pushed executes the pathing calculations in about 54 seconds, so the net result is approximately a 4 minute turn. More gratifyingly the gain is all in pathing, so that's really an improvement from 155 seconds to 54 - a factor of 3 roughly.
Although I do plan to continue with more work on the pathing (you'll see why later), I intend to move onto other aspects of turn time next (city processing is now the dominant factor, so it's next for some intensive scrutiny). This cycle (so until V31) I don't plan to do anything except improve performance and fix major bugs.
So anyway, here's a little more detail, since things did not go entirely as I had expected, or initially hoped!
My first attempt had three big problems:
1) Because I was restricting my work to path generator internals, the thread switching/scheduling latency was comparable to (or larger than) the time to generate a typical path (order of 1 milli-second). This meant that it was extremely hard to get performance gains from multiple threads, since the startup time of a background thread processing any request was comparable to the entire time to complete generation of a path. Although I was able to get around this to some degree the net effect is that locking costs quickly chew up any potential benefit, and scaling with thread count was extremely poor.
2) Multi-threading introduces a degree of non-determinism in the order things get done, and unless very severe constraints are imposed, this results in some non-determinism in the path generated when there are multiple optimal paths. In a multi-player environment this would lead to OOS errors, so it's not acceptable (worst case of course, is that you can disable multi-core processing in MP games, but that's not ideal)
3) Most seriously of all, debugging a more complex implementation, and especially a multi-threaded one is a lot harder. I quickly found that errors were occurring that I couldn't reliably reproduce when I ran long turns.
At that point I decided to step back and construct an extensive self-testing framework for the path generation. This works as follows:
i) I added back the original path generator code, and kept the new version as a separate class, so I could instantiate both at once.
ii) I added a random-number (with fixed seed) driven path test case generator. This is triggered (in the debug version) by pressing ctrl-alt-P and generates 10000 random paths against the current map in groups of 30, each group beginning at the same random (land) starting point and each path in the group choosing a random end point within a distance of 30 from the start (choosing totally random end points from anywhere on the map would produce far too many paths that could not be completed - limiting the range both simulates better the kind of paths the game generates normally, and increases the proportion that are possible to path)
iii) Each group of paths is generated with each path generator (the 'known quantity' old one, and the under-test new one). The code flags up any discrepancies in ability to generate the path, total length of the generated path, and total cost of the generated path. Any such discrepancy is considered a test failure
Because this is a deterministic (but randomized) set of paths, it means that every map I can use will give a different test set, but for any given save the same set will always be generated.
This enabled me to fairly rigorously test changes as I made them. It also served to identify bugs in the existing implementation (of which I found about half a dozen), since when differences occurred, they sometimes turned out to be bugs in the original that didn't manifest in the new version.
This process also gave me a much better feel for the circumstances and condition that were occurring during path generation, and sparked a range of ideas for optimizations. Since it is way easier to debug a single-threaded implementation than a multi-threaded one (and since the latency issues were making good multi-core scaling hard to achieve), I decided to concentrate on making a number of single-threaded optimization first, and then to multi-core enable the resulting optimized generator.
In the end the single-threaded optimizations were so successful that I have checked them in (that's basically today's SVN push), and my expectation is that it will turn out to not be worthwhile to multi-thread at this level (I expect to get maybe 20% more gain from throwing an entire second core at it, which is not worthwhile scaling, given that it will reduce the opportunity for the first core to turbo up on modern machines). However, I plan to continue and verify that, before deciding that multicore enablement at this level is a dead end.
In practice, to get it scaling well across multiple cores, I think the parallelism has to be undertaken at a higher level (generating multiple paths in parallel rather than several threads working on any one path). Unfortunately that is a LOT more complex to achieve (needs much more locking, and involves much less contained areas of code), and itself may have limited scaling, because currently the cost-trees generated by one path generation attempt are re-used in the next (for the same group), and parallelizing this will result in potentially significantly more overall work, as multiple parallel attempts work in sub-optimal directions because they don't have predecessor-derived information to guide them.
This latter effect may mean that good scaling cannot be achieved without processing multiple groups simultaneously, which again extends the scope of locking needed, and (at least naively) introduces race conditions that could result in multiple groups trying to address the same goal (those are all solvable by yet more locking of course, but the complexity goes up at each step). Any attempt to parallelize at this level will also be VERY, VERY hard to make OOS-safe in an MP game environment (because groups will be processed in a non-deterministic order, and since the result of one group's decision making constrains that of other's, that will tend to make the overall result non-deterministic). This non-determinism will also mean that bugs will have less reproducibility.
My intended next steps are as follows:
- Switch to looking at the optimization of city processing. Currently I think this CAN be done effectively in parallel, but I also have some ideas for significant speed ups that do not involve multi-threading. I expect to make significant gains here before V31, and since this is (now) the dominant cost for mature-game turn processing, I expect further gains of several 10s of percent in turn times over and above the version pushed today.
- More pathing optimizations, including experimental verification that low-level parallelism isn't a sufficient gain to justify its use.
- Another approach to trying to use a more modern compiler for parts of the code (initially small parts, but that is just proof-of-concept), so as to allow compiler optimization for more modern processor architectures than the current (2003) compiler performs (this is known to be worth something of the order of a doubling of overall performance if we can achieve it)