DLL optimizations I'm considering.

OPTIMIZED?

  • OPTIMIZED!

    Votes: 14 87.5%
  • DE-OPTIMIZED!

    Votes: 2 12.5%

  • Total voters
    16

billw2015

King
Joined
Jun 22, 2015
Messages
837
Its just a quick grab bag of stuff I noticed that piqued my interest (and can remember off the top of my head right now), that I want to investigate with an eye to improving performance:
  • Sparse bool arrays - these seem to be used a lot, and in every case I so far looked at they are forcing the code to be slower instead of faster, and the memory usage to be higher than it should be (which also impacts performance).
  • Massive classes - the bigger an object is in memory the less of it fits in cache.
  • Vectors of pointers - this is bad performance wise in almost all cases. Of course some design requirements might make this impossible, but, as an example, the replacement system, where this is most used, is NOT one of them. At least as far as I can tell thus far from my surface reading of the code).
  • Pointers in general - they shouldn't be the default for how to instantiate an object, but they seem to be in this code, so I'm going to look into that. Maybe most or all uses are the correct choice, I shall see.
  • Const correctness - yes the optimizer can in fact perform better when it knows that code doesn't have side effects. It will also lead to much easier ability to apply...
  • Data parallelization - once you know that the contents of a for loop *actually* don't have side effects (not that const actually guarantees it, but it is most of the way there), you can run it on more than one thread so it finishes faster. I don't know how many places could currently benefit from this, but some places that currently can't might be refactored such that they could.
  • Branching - its slow. Especially convoluted branching will hurt the branch predictor and the optimizer. I already noticed plenty of redundant branching as well, where constraints are checked multiple times.
  • Critical Sections - but I think alberts is removing all this stuff.
  • Bake the static portions of the AI scoring - a large part of the valuation that the AI gives to buildings when deciding which to build is based off their raw numbers and not contingent on any external factors, and thus can be pre-computed, instead of computed for every city, for every building, every turn. Probably the same is true of unit builds.
Maybe I can't improve on the code in some or any of these cases, they are just my check list of things to investigate.

/edit
The above stuff wouldn't change gameplay or behavior at all (intentionally at least!).
There is another set of things I want to look at (I didn't work them all out yet, I just know they are there waiting to be discovered) that *would* change gameplay behaviour. This doesn't mean they make it worse, or even noticeably different though. In fact in the end they could lead to gameplay/AI improvements implemented using the spare CPU cycles that were saved.
  • Better utilization of caches once they are formed, e.g. longer term city build planning for the AI.
  • More accurate cache invalidation rules - caches don't need to be cleared at the start of a round, instead they can be cleared or partially cleared when values known to effect them have changed.
  • Multi-level cache with standardized representation, from which invalidation rules can be derived automatically - this would be the ultimate way to model the caches. Not a trivial task.
  • Queue multiple buildings - the AI is already evaluating and assigning a score to every possible building every time it changes production. Why not just put the top 5 rated buildings in the queue straight away? Then decide each turn if the AI feels strongly about clearing the rest of the queue due to some significant change in circumstances (war, tech with an un-built wonder in it etc.) This is how most humans play after all.
  • I didn't look into unit AI at all yet, so more ideas to come hopefully!
 
Last edited:
Data parallelization - once you know that the contents of a for loop *actually* don't have side effects (not that const actually guarantees it, but it is most of the way there), you can run it on more than one thread so it finishes faster. I don't know how many places could currently benefit from this, but some places that currently can't might be refactored such that they could.
Should be the last thing to consider after all other options.
 
Should be the last thing to consider after all other options.
I was part way through replying to your same comment in the other thread. I'm wondering why you think this.
/edit

What I am suggesting is nothing remotely like what it appears was attempted previously, based on the fragments I have seen. That appears to have been an attempt to apply task parallelism to legacy single threaded code, which was pretty much doomed from the start.
Obviously easy to say in hind sight, but I have attempted the same thing before and know what the pit falls are. TL/DR Globals, and if you manage to deal with them you still probably hit so much contention from the copious locking you just added that your tasks might even be slower than they were before.

Data parallelism is a different proposition. Instead of attempting to synchronize changes to global state you are instead finding pieces of looped code where none are required.

A good example I have found is the scoring of buildings for the AI. It loops over 50+ buildings doing a set of fairly complicated calculations, NONE of which should be changing any shared state. No iteration of that loop depends on another iteration. Currently however there IS shared state being modified in the form of various caches that this loop interacts with. The task to parallelize it would be to root out and secure these caches, perferably in a lock free manner. Alternatively refactor them to a separate cache warming step to be run before the main loop.

I think there are probably a few other examples of these kinds of loops that can be parallelized.

However this certainly isn't the first thing I am going to look into, it isn't a trivial task, but it isn't really scary to me, its something I have done plenty of times in similar code. Of course that doesn't mean it will *actually* be faster, just that it won't break. It's always possible there are no significantly costly loops where the win can outstrip the overhead, but it seems unlikely given how much calculation is being done all over the place.
 
Last edited:
There's still alot of potential optimizations to be done before i would consider any kind of multithreading at least that's what i think.
The last time multithreading was added it was done as miracle cure for all kinds of programming issues which lead to really really slow turns. Insted of spending time to do multithreading it would have been possible to just start fixing the slow code insted. Some of these issues where actually really easy to find and now that they are fixed that multithreading is slowing things down instead.
 
There's still alot of potential optimizations to be done before i would consider any kind of multithreading at least that's what i think.
That's great if there is. I'm not about to start trying multithreading of any kind of code I don't understand fully, I was just joking about immediately starting the next aborting implementation of multithreading :mischief:

The last time multithreading was added it was done as miracle cure for all kinds of programming issues
Yeah I wish I had been here to argue against it!

Insted of spending time to do multithreading it would have been possible to just start fixing the slow code insted.
I want to do both. I really want end turn times down to <10% of what they are now, I think it is possible, but using extra cores will be a significant part of that. Optimizing on a single thread can only go so far, and however far it can go, I want to go further.

Some of these issues where actually really easy to find and now that they are fixed that multithreading is slowing things down instead.
I'm interested in what ways the multithreading was slowing things down and by how much? Was it contention, thread overhead, the actual pipelining system (I don't know anything about it really, but I kind of get the gist from my call stacks)?
Regardless *data* parallelism (loop parallelization) suffers from these issues far less. In the ideal case the only cost is in thread overhead, which is highly predictable. A benefit is you can always reduce the scope of the parallel part of the code down until it is easy to manage or implement.
One of the reasons I am not too trepidatious about doing this is that the changes it requires (or strongly encourages) to existing code also generally improve the health of the code base. For instance:
  • making sure const is appropriately used
  • removing use of mutable (I saw it, I suspect it is there because const correctness has only been partially implemented elsewhere, which I would fix)
  • making sure caches are thread safe (preferably in a lock free manner)
  • appropriately labelling and constraining changes to global state
I want to tackle those changes regardless of multithreading, it just happens that easier multithreading would also be a consequence of them.
 
Last edited:
I'm interested in what ways the multithreading was slowing things down and by how much? Was it contention, thread overhead, the actual pipelining system (I don't know anything about it really, but I kind of get the gist from my call stacks)?
Mostly the critical sections but some other code changes to enable multithreading and the pipelining system itself play their part as well.
 
Sparse bool arrays - these seem to be used a lot, and in every case I so far looked at they are forcing the code to be slower instead of faster, and the memory usage to be higher than it should be (which also impacts performance).
CvTechInfo looks like a good place for improvement. Not exactly bool arrays but same idea.
Code:
    bool isRepeat() const;
    bool isTrade() const;
    bool isDisable() const;
    bool isGoodyTech() const;
    bool isExtraWaterSeeFrom() const;
    bool isMapCentering() const;
    bool isMapVisible() const;
    bool isMapTrading() const;
    bool isTechTrading() const;
    bool isGoldTrading() const;
    bool isOpenBordersTrading() const;
    bool isDefensivePactTrading() const;
    bool isPermanentAllianceTrading() const;
    bool isVassalStateTrading() const;
    bool isBridgeBuilding() const;
    bool isIrrigation() const;
    bool isIgnoreIrrigation() const;
    bool isWaterWork() const;
    bool isRiverTrade() const;
    bool isCanPassPeaks() const;
    bool isMoveFastPeaks() const;
    bool isCanFoundOnPeaks() const;
    bool isEmbassyTrading() const;
    bool isEnableDarkAges() const;
    bool isRebaseAnywhere() const;
    bool isEnablesDesertFarming() const;
They all look to me like only 1 tech gonna be true for each.
Personally I would take these out and 100% handle this stuff with python. theres already a tech acquired event with tech type sent.
Or make 1 bool to represent them all. When xml is loaded if the tech type has any one of these bool set true then set the new single bool true and then save the current tech type with single int in globals or somthin
Edit: Enums
 
Last edited:
CvTechInfo looks like a good place for improvement. Not exactly bool arrays but same idea.
Code:
    bool isRepeat() const;
    bool isTrade() const;
    bool isDisable() const;
    bool isGoodyTech() const;
    bool isExtraWaterSeeFrom() const;
    bool isMapCentering() const;
    bool isMapVisible() const;
    bool isMapTrading() const;
    bool isTechTrading() const;
    bool isGoldTrading() const;
    bool isOpenBordersTrading() const;
    bool isDefensivePactTrading() const;
    bool isPermanentAllianceTrading() const;
    bool isVassalStateTrading() const;
    bool isBridgeBuilding() const;
    bool isIrrigation() const;
    bool isIgnoreIrrigation() const;
    bool isWaterWork() const;
    bool isRiverTrade() const;
    bool isCanPassPeaks() const;
    bool isMoveFastPeaks() const;
    bool isCanFoundOnPeaks() const;
    bool isEmbassyTrading() const;
    bool isEnableDarkAges() const;
    bool isRebaseAnywhere() const;
    bool isEnablesDesertFarming() const;
They all look to me like only 1 tech gonna be true for each.
Personally I would take these out and 100% handle this stuff with python. theres already a tech acquired event with tech type sent.
Or make 1 bool to represent them all. When xml is loaded if the tech type has any one of these bool set true then set the new single bool true and then save the current tech type with single int in globals or somthin
Edit: Enums
Yes most only occur on one tech but not always.There is normally only one tech that isRepeat() is true on and it is (always) called TECH_FUTURE_TECH so that it does not cause issues elsewhere. However I have seen mods that have multiple repeat tech that each add something different eg :).

The set isRiverTrade() and isTrade() needs updating to include Solar System and Galactic trade. These can be on multiple techs but usually only one is used.

I agree that this is another case where the input model XML should not necessarily match the game model. It should probably stay on the tech XML since that is where modders are used to looking for such things but in game it would be on something else. You can change the model just inside the program (dll) you know it doesn't have to be done everywhere:lol:. I have had 40 years of experience on doing just that. Keep it simple for the people entering the information and convert it to what is useful, computer time is much cheaper than human time.
 
Sparse bool arrays
I wrote a class called BoolArray, which is actually an array of unsigned ints. Each int is then used as 32 bools. I have been using it a while and it's working well. I do have a plan for an upgrade (template instead of constructor for lengths etc) and writing this post made me realize I should move the code into EnumArray. This way you can have EnumArray<UnitTypes,int> for an int for each unittype and it supports all int like types (short, char, enums like UnitClassTypes etc). Adding support for <UnitTypes, bool> to access the BoolArray storage system shouldn't be much more work. It will certainly be more intuitive for the programmer to just add it to the existing class rather than adding a new class specifically for bools.

Massive classes - the bigger an object is in memory the less of it fits in cache.
CvInfoBase use 320 bytes for what? Some strings and a bunch of virtual function pointers. Most of this will never be used, though there is a risk that the exe expects the class memory layout to be like this.... not great.

Also why virtual functions? It would appear to be a misunderstanding of how virtual functions because the behavior would be the same without virtual because the virtual functions are called from template functions, meaning they will call the child classes, hence the functions in the child classes. The only exception is the virtual deconstructor, which is actually used and would cause child classes to not call deconstructors if it wasn't virtual. Also virtual functions were hot in the 90s and early 0s, but not so much anymore due to performance issues.

Vectors of pointers
It's used for info classes because it's simple. What happens during xml loading is a new instance is allocated, data is read into it and then it is added to the vector. However if the Type is already present, instead of push_back, it will replace that index. Now if the read code reads the Type string, figures out the index and then calls read on that index. It could possibly be the newly added index or it could be an existing one. This will allow a vector of instances instead of a vector of pointers.

There is an issue of include though. A vector of pointers only needs to know the class exist while a vector of the actual data needs to know the content of each class. This is an issue because CvGlobals.h is included first and it needs to be included after CvInfos.h if it should know the data. Some headers contains inline functions, which uses GC, though I'm not sure about the info classes. It's something, which needs attention. It should be fixable with enough determination though.

I have been wondering how to optimize this. Part of the problem is the code is optimized for something like you have one unit, then you get a pointer to the unit info for that unit, which you use multiple times to look up data for that unit. It's not optimized for looping all unit infos.

One solution would be caching. Say you have build info. The worker loops all build infos to figure out what it can build on the plot in question. To do that, rather than having one build info, it could be one super build info, which has an EnumMap for each int. This will allow looping all infos for the same int in each and they will be stored sequentially in memory, perfect for CPU caches. It could also be an array (vector?) of parts of the info class, like it stores required terrain, feature etc for a buildtype, meaning all the stuff needed for fast looping is together and nothing else, meaning it uses say 20 bytes for each type (didn't look for details), which makes lookup for all data for one info fast, but it's still stored sequentially for all infos, meaning it's also fast for looping.

The problem is how to code this, particularly if it needs to avoid storing the same data multiple times.

Pointers in general - they shouldn't be the default for how to instantiate an object, but they seem to be in this code, so I'm going to look into that. Maybe most or all uses are the correct choice, I shall see.
Generally speaking, if a pointer can be NULL, it should be a pointer. If it can't be NULL, it should be a reference. There is however a problem regarding references and that is you can make a copy if you forget the & and the compiler happily accepts that. I spent quite a while hunting for why my info classes started breaking and I found a missing &, which meant I made a local copy of the info class, which when it went out of scope would call the deconstructor and free the memory. The original still points to the original memory (because the pointers were just copied like they were ints) and then I would get a crash the next time they were used.

The solution turned out to be simple though.
PHP:
class CvInfoBase : private boost::noncopyable
Now the compiler will cause an error if I try to copy the info classes by mistake and the missing & is no longer a serious danger. Pointers can't copy by accident if you forget the *. This is something to keep in mind.
 
would be the same without virtual because the virtual functions are called from template functions, meaning they will call the child classes
You might be thinking of some specific cases where template functions are being used on pointers to the child classes, but just to clarify in case not: this won't happen if you pass a base class pointer to a template function, the template will just treat it as the base class. Example here.

A vector of pointers only needs to know the class exist while a vector of the actual data needs to know the content of each class.
You can use forward declaration like std::vector<struct Foo>. Example here.

Generally speaking, if a pointer can be NULL, it should be a pointer. If it can't be NULL, it should be a reference.
Yes, but I actually meant it should be a *value* in most cases. This isn't necessarily true for this particular project, I didn't look into each instance of pointer use yet, but a lot of coders are allergic to just copying a piece of data, preferring to use a shared instance. However as we know, dereferencing a shared instance is actually very costly if it causes a cache miss. So in some circumstances it would make sense to just copy the data and store it as a value member. Specifically when it is used a lot and modified infrequently, and doesn't require sharing semantics.
 
You might be thinking of some specific cases where template functions are being used on pointers to the child classes, but just to clarify in case not: this won't happen if you pass a base class pointer to a template function, the template will just treat it as the base class. Example here.
I'm specific to the xml reading code (the vanilla code anyway) where it's like: (greatly simplified)
PHP:
struct Base
{void read();}
struct Derived : Base
{void read();}

template <class T>
read(T& var)
{
var.read();
}
The template is in the caller, not the class itself. This means caller will call read in Derived or Base depending on which T was used to call the function and it works without read being virtual.

You can use forward declaration like std::vector<struct Foo>
I just did a quick test and the C++03 compiler actually accepted it this time. I remember messing around with it at some point, but redesigned because it gave me issues. I can't remember the details though.

Yes, but I actually meant it should be a *value* in most cases. This isn't necessarily true for this particular project, I didn't look into each instance of pointer use yet, but a lot of coders are allergic to just copying a piece of data, preferring to use a shared instance. However as we know, dereferencing a shared instance is actually very costly if it causes a cache miss. So in some circumstances it would make sense to just copy the data and store it as a value member. Specifically when it is used a lot and modified infrequently, and doesn't require sharing semantics.
I can't remember any shared memory off hand other than pointers to info classes. Yes they do not change, but they are also rather big. I do however agree that while copying the entire info class is stupid, it could be considered to copy a few specific values. If you only need 2-3 ints, then copy those instead of using a pointer each time.

Another thing to consider is that the lookup can even be multiple layers, like get int from xml file A, then use that int as argument in a get function in xml file B. Vanilla is quite bad at not figuring out unchanging inputs and just assume it needs to be calculated from scratch each time. A good example of this is profession yield cost (in vanilla colonization, though BTS use the same style). It can only change when a team gains a founding father, meaning it's an ideal candidate for calculate array once and then lookup instead of calculating whenever it's needed. I did add such a cache and the AI turn went from 40 to 33 seconds.

Also another issue with calculating numbers each time is that sometimes it adds a bunch of ints and then it multiplies and divides with 100 to apply modifiers. After adding, you can add if 0, return 0, because we know that 0X / 100 = 0. There is no need to waste time doing all those divisions when the input is 0. In fact division is slow and should be avoided as much as possible. Instead of dividing for each step, end by dividing with 100^X, assuming that won't risk overflow. X/10000 is twice as fast as (X/100)/100, but it provides the same result (except maybe some edge cases due to rounding).

One other thing, which isn't mentioned: do not use GC.getDefineINT(). The lookup is dead slow. Maybe not horrible in computer science terms, but since we know the result will always be the same, storing an int somewhere, locally, in GC, globally or whatever will speed up the code significantly. The code behind getDefineINT() is an unordered map, aka hash table. Even if implemented perfectly, it will always be dead slow relatively to reading a cached int.
 
In fact division is slow and should be avoided as much as possible.
This kind of thing isn't worth worrying about on a modern CPU unless your profiler tells you to. The CPU pipelining and compiler instruction re-ordering mean you can't know the real world cost of a single instruction. Also a divide by 100 on an int won't be a divide instruction (unless you are casting it to a float) as seen here.

After adding, you can add if 0, return 0, because we know that 0X / 100 = 0.
Don't do that! Branch mispredicts are as bad as a floating point divide.
You can see here that a modern optimizer just entirely removes this attempt at doing its job for it!
 
Don't do that! Branch mispredicts are as bad as a floating point divide.
You can see here that a modern optimizer just entirely removes this attempt at doing its job for it!
The real question is what is the 2003 compiler going to do. I assume compiler optimization have improved in the last 16 years. Earlier today I started wondering if Compiler Explorer can be set up to use our ancient compiler. It would be interesting to know what it is doing.
 
One other thing, which isn't mentioned: do not use GC.getDefineINT(). The lookup is dead slow. Maybe not horrible in computer science terms, but since we know the result will always be the same, storing an int somewhere, locally, in GC, globally or whatever will speed up the code significantly. The code behind getDefineINT() is an unordered map, aka hash table. Even if implemented perfectly, it will always be dead slow relatively to reading a cached int.
I've heard this before. Do you have any evidence that this actually has any impact? I seems to me that most DefineINTs are not used often enough for this to make a difference. Or do you only do that selectively?
 
The real question is what is the 2003 compiler going to do.
It would be interesting to know, but it isn't the real point, which is that you shouldn't do that early out if to avoid arithmetic regardless of if the optimizer could remove it, as it is actually slower (significantly).
To see what VC2003 would generate quickly you can just put the code in DllMain and then breakpoint and switch to disassembly, but it would be nice to have it in a box to play with more easily!
 
I've heard this before. Do you have any evidence that this actually has any impact? I seems to me that most DefineINTs are not used often enough for this to make a difference. Or do you only do that selectively?
In Religion and Revolution (now We The People) I profiled and found a line, which consistently used 1% of the CPU time spent by the AI. It was in CvUnit::canHaveProfession, which is called many times for each unit. All that line did was calling getDefineINT. Caching that int and replacing the line with a GC call to get the now cached int made the AI use 1% less CPU time. Sure 1% might not sound like much, but for a single line of code I consider it to be a lot.

I can't answer how frequently you need to call getDefineINT before it matters for performance, but I do know that there are places in the code where it shouldn't be. Also caching the result might not have to be done globally, do it before a loop instead of in each iteration would save time without touching code outside the function in question. It's not that getDefineINT is banned, it just shouldn't be used without considering how often the function will be called.
 
After Bill made it possible for anyone to build the dll I went and turned the xml global defines into #define.
Never put them to use tho.
 

Attachments

I went and turned the xml global defines into #define.
I wrote a perl script in WTP (develop-2.8 branch), which does something like that automatically at compile time, except it takes the Type from all the xml files and build enums out of them. It doesn't support modules though as in it's one enum for each xml file.

I'm sure it wouldn't be too much work to alter this to generate those defines if that's what you want. Generally speaking the compiler optimizes better the more numbers are known at compile time meaning you are on the right path for optimization if you are willing to sacrifice the ability to alter the xml file in question without recompiling the dll. Another tradeoff is that compiling then requires people to install strawberry perl, though that one at least has a trivial installer.
 
I was actually thinking of memory usage and not just performance when I made that. Those #define things take no ram right? And any xml values can be built into the dll like that after some modding. Seems kinda like free xml. I just don't know if it would be too much work to be a good idea so started off with the simplest xml.
 
I was actually thinking of memory usage and not just performance when I made that. Those #define things take no ram right? And any xml values can be built into the dll like that after some modding. Seems kinda like free xml. I just don't know if it would be too much work to be a good idea so started off with the simplest xml.

The reason that these values are defined in xml files is to make it possible to tweak them without having to compile the dll each time.

This allows players to change settings and putting these values as defines into the source code makes that impossible.
 
Back
Top Bottom