DLL optimizations I'm considering.

billw2015 · Oct 11, 2019

Also, the amount of memory you would save by putting GlobalDefines in a header instead is going to be so trivial as to be not worth the effort (I'm just going to assert that unless someone has data showing otherwise), *and* the xml can be reloaded dynamically so modders can experiment with different values without restarting the game (I did this with the new paging system to tweak the values).
Certainly inner loops and such like should not be repeatedly looking up the same value over and over, whether it be a global define setting or any other value from a map; just look it up once before the loop and put it into a variable.

One thing we *could* do with global defines (and other settings) though would be to generate a class GlobalDefines header from the xml structure, along with a load function for it (probably just the constructor itself can load the xml). That way we can avoid using lookup by string entirely, get full type safety, AND fast access, all without any extra caching or variable required anywhere.

Nightinggale · Oct 11, 2019

MattCA said:
I was actually thinking of memory usage and not just performance when I made that. Those #define things take no ram right? And any xml values can be built into the dll like that after some modding. Seems kinda like free xml. I just don't know if it would be too much work to be a good idea so started off with the simplest xml.

You are looking at the wrong location for saving memory. When you reduce the memory usage of a class, whatever you save should be multiplied with the number of instances there is of that class. This means saving 40 bytes in CvGlobal saves 40 bytes because there is just one instance of that class. However saving 4 bytes in CvPlot will save a whole lot more because it's 4 * number of plots in the game.

Hardcoding numbers, which are stored in CvGlobals should be done for performance reasons and not memory reasons. However you need to start with identify the xml read numbers as a slow part because it doesn't matter if you make it 90% faster if it only use 0.00001% of the CPU time to begin with.

Also there is no such thing as "no ram" by moving xml data into the C++ code because the dll file is copied into memory. This means if you have 12 bytes of additional compiled code, the dll will use 12 bytes more. Sure using a define can reduce the dll size because hardcoding a number takes less space than a function call to the get function, but it's still not 0 bytes and it's not something you should consider for memory usage concerns. It's all about memory access time, hence performance.

It's also worth considering making an enum rather than defines because that way you tell the compiler you want an unchanging int. You can also use more strict typing with enums, allowing compiler errors if you mess up, though it's not obvious how to use the strict typing with what is currently defines.

alberts2 said:
The reason that these values are defined in xml files is to make it possible to tweak them without having to compile the dll each time.

This allows players to change settings and putting these values as defines into the source code makes that impossible.

Yeah and that's why hardcoding xml data can be problematic. Ideally it should go the other way and defines end up in xml. This was a topic when I decided I wanted to make enums out of all the xml files. The main goal was at the time to tell the debugger all the types since it at the time said something like "unittype 79 has profession 11", which really isn't human readable. In order to preserve the ability to alter xml without recompiling I added a compiler flag to turn the hardcoding on or off. Now that it can use enum values like NUM_UNIT_TYPES when looping, it should still work when not hardcoding, meaning I have added global variables of the same names, all instances of the enums fitting for each file. Setting all those variables is the first thing, which takes place after reading the length of the xml files. The setup code is inside #ifdef to prevent it from trying to assign values to enums at runtime.

Having global variables like that comes with a new problem. They can be changed at runtime. However this can be checked by compiling a hardcoded dll because the compiler will fail if you assign a value to an enum. This means error checking for this part of the code consist of compiling a hardcoded and a non-hardcoded dll file and errors will be detected at compile time.

Which type of dll to use for the next release is still an open question. It depends on future profiling result. It's entirely possible that there will be two dll files, one for players and one for players who wants to edit xml files. The latter also have asserts enabled because xml data is validated with asserts. Non-assert based error checking will check if hardcoded values matches xml during startup, meaning it can't silently fail. Also it's worth nothing that even without hardcoding it should still be faster because reading a global variable is faster than calling getNum*Types() and then use that to call vector::size.

Hardcoding xml data into the dll is a valid approach and beneficial if done right. You do however need to really understand the tradeoffs and figure out how to handle those in an acceptable way.

billw2015 · Oct 11, 2019

Nightinggale said:
Also there is no such thing as "no ram" by moving xml data into the C++ code because the dll file is copied into memory. This means if you have 12 bytes of additional compiled code, the dll will use 12 bytes more. Sure using a define can reduce the dll size because hardcoding a number takes less space than a function call to the get function, but it's still not 0 bytes and it's not something you should consider for memory usage concerns.

Thinking about this now, there actually could be significant performance AND memory savings beyond just those from removing the value itself, or the function call to read the value. The optimizer has much more options when it comes to optimizing a hardcoded value than it does when optimizing a function call, including entirely eliding if statements, unrolling loops, inlining functions. I might do a profile comparison using this technique at some point, it could offer a few performance surprises we can't predict!

Nightinggale said:
This was a topic when I decided I wanted to make enums out of all the xml files. The main goal was at the time to tell the debugger all the types since it at the time said something like "unittype 79 has profession 11", which really isn't human readable. In order to preserve the ability to alter xml without recompiling I added a compiler flag to turn the hardcoding on or off.

As an alternative solution you might be interested in our c2c.natvis file. It can resolve in the debugger enum names, along with a lot of other stuff. It can even display FreeLists as arrays (that was fun to write in natvis xml language...). It shows plot and unit type and class by name, the owning civ leader name, plot contents, etc.
Example here (its old so plot wasn't implemented yet): https://cdn.discordapp.com/attachments/600993642488004619/618171971095625748/unknown.png

Nightinggale · Oct 11, 2019

billw2015 said:
Don't do that! Branch mispredicts are as bad as a floating point divide.

I'm wondering about that statement and looked it up. Take a look at this overview of operation costs. Matt Godbolt (creator of Compiler Explorer) used this in his cppcon presentation and called it good (though he doesn't take credit for making it). I assume that means it's a trustworthy source. Also mind you it says the numbers depends on the CPU and various compiler optimization tricks, but it still gives a good idea of how fast each type of operation is relative to each other.

Let's look at some example code where conditionally returning early could be considered.

PHP:

 int iVar =0;
iVar += getA();
if (iVar == 0) return;
iVar *= getB();
iVar /= 100;
return iVar;

*= takes 1-7 cycles and while int division takes 15-40, we assume the compiler to break the division down to a series of bitshifts and additions, which can be done in "a few" cycles. However we also need to call getB() and if it isn't inline, it takes 15-30 cycles. If we assume it's a cache miss, the memory latency adds 100-150.
In total that's 116 to 187 cycles (assuming instant add and bitshift). If getB() is more complex than just returning an int, it can become even worse. A wrong branch prediction takes 10-20 cycles, though according to the text 15-20 is more likely. Still 20 is far less than 116 meaning guarding a likely cache miss with a conditional return seems like a good choice.

Sometimes the modifier applying ( *= getB() /= 100 part) applies more than a single modifier. If it's 3 and all of them has at least 50% risk of a cache miss, then the conditional return seems like the right choice even if we can't assume 3 modifiers taking 3 times as long as a single modifier. If the get functions are inline, then it's entirely possible that all 3 are called at the same time, meaning 3 isn't much slower than 1. We can't tell for certain, but it's possible.

billw2015 said:
Thinking about this now, there actually could be significant performance AND memory savings beyond just those from removing the value itself, or the function call to read the value. The optimizer has much more options when it comes to optimizing a hardcoded value than it does when optimizing a function call, including entirely eliding if statements, unrolling loops, inlining functions. I might do a profile comparison using this technique at some point, it could offer a few performance surprises we can't predict!

I'm fairly certain I know of a location where hardcoding the length of xml files will make a huge difference. That's looping through a specific xml file where I know we only have a single entry. I assume that will cause the loop to simply vanish and be reduced to just a call to the first entry in the vector in question, which only has one entry.

Also now that you mention inlining, when virtual functions were mentioned, part of the reason why they are bad is that they prevent inlining, but it's also the fact that according the link takes 30-60 cycles instead of 15-30 cycles to call.

billw2015 said:
As an alternative solution you might be interested in our c2c.natvis file. It can resolve in the debugger enum names, along with a lot of other stuff. It can even display FreeLists as arrays (that was fun to write in natvis xml language...). It shows plot and unit type and class by name, the owning civ leader name, plot contents, etc.

I would have been very interested if I haven't finished the enum approach already. Then again maybe it is a good idea to look into it anyway because the enum approach alters the dll while the natvis doesn't. In theory that might affect whatever bug is being investigated, though I find it highly unlikely.

billw2015 · Oct 11, 2019

Nightinggale said:
I'm wondering about that statement and looked it up. Take a look at this overview of operation costs. Matt Godbolt (creator of Compiler Explorer) used this in his cppcon presentation and called it good (though he doesn't take credit for making it). I assume that means it's a trustworthy source. Also mind you it says the numbers depends on the CPU and various compiler optimization tricks, but it still gives a good idea of how fast each type of operation is relative to each other.

Yeah that is my source for CPU operation costs if I want to quote something.

Nightinggale said:
Let's look at some example code where conditionally returning early could be considered.

Sure although I was saying don't try early out to avoid a single integer divide, not don't ever early out :crazyeye:

Nightinggale said:
Also now that you mention inlining, when virtual functions were mentioned, part of the reason why they are bad is that they prevent inlining, but it's also the fact that according the link takes 30-60 cycles instead of 15-30 cycles to call.

Yeah, although make sure you have /LTCG linker flag and /GL compiler flag (you may have them already in your MakeFile), or you aren't getting basically any inlining at all, except with in a single compilation unit. Enabling those two flags for our FinalRelease build gives something like 50% speed up, but increases link time to 15 minutes or so on a fast computer.

One place that helps wrt to virtual functions is to flatten the FreeList classes (remove the base class and just copy it into the main classes). I think this gave us some few % bump in performance alone.

Nightinggale said:
I would have been very interested if I haven't finished the enum approach already.

You should check it out anyway, it provides a lot more than just the enums: being able to browse unit list on a plot, city list of a player, etc., and showing only a set of useful properties by default (instead of 100 randomly ordered ones), and you just have to add it to the solution, no other work necessary.

f1rpo · Oct 11, 2019

For yet another way to handle GlobalDefines: I've defined a macro so that GlobalDefines can be cached at CvGlobals by just adding the tag name to the macro (GitHub link). Often, I just use a static local variable instead; many of those "global" defines should really matter only in one place, so keeping them in local variables seems appropriate.

Nightinggale · Oct 11, 2019

f1rpo said:
Often, I just use a static local variable instead;

You have to be aware of what static is doing.

PHP:

void func()
{
    static int A = 42;

It compiles to the same as:

PHP:

bool bStaticSet = false;
int A;
void func()
{
    if (!bStaticSet)
    {
        bStaticSet = true;
        A = 42;
    }

This means setting a static will introduce conditional branching. This means it's not the fastest approach, but it's still recommended on the forum because it's so simple to code and it's way better than calling getDefineINT each time. In most cases this falls into the "good enough" category.

MattCA · Oct 11, 2019

The 2 places on the forums I like to read is here, at C2C and the main threads where Nightinggale and f1rpo hang out. So its convenient for me having everybody here.

alberts2 said:
The reason that these values are defined in xml files is to make it possible to tweak them without having to compile the dll each time.

Now that anyone can build the dll without setting up vs, id say the benefits from the xml are reduced.
Perhaps C2C is uniquely qualified for a modification like that.

billw2015 said:
I might do a profile comparison using this technique at some point, it could offer a few performance surprises we can't predict!

If you don't mind wasting time testing, I don't mind wasting time changing the getDefineINT() calls.

Nightinggale · Oct 16, 2019

billw2015 said:
Sparse bool arrays - these seem to be used a lot, and in every case I so far looked at they are forcing the code to be slower instead of faster, and the memory usage to be higher than it should be (which also impacts performance).

I solved this and many other issues related to arrays by writing EnumMap. While it's still wip, it seems to be done for all the common tasks (get, set, reset etc).

The class is intentionally just a pointer, meaning it's 4 bytes. It's a template class taking two arguments. The first is the array type (say UnitTypes), which is used for both strict type for functions requiring an index argument and for length. The second is what kind of variable should be stored at each index (int, short etc).

Certain functions (particularly get and set) are specialized template functions, meaning they do not actually do the same for all template classes. If the storage is bool, it will create an array of unsigned int32, each called a block in the code. It will store 32 bools in each block, meaning if you need 80 bools, it will allocate 3 blocks, or 12 bytes in total.

If length is known at compile time and it's max 32, then a bool array will not contain a pointer. Instead it will use the fact that the pointer is in a union to replace the pointer with a block and use that directly. No need to look up memory elsewhere. I expanded on this and allow the class to add more than one block if needed, but only if the length is known at compile time and if length is <= some enum value, which I have set to 64. This means Enum<TeamTypes,bool> will not use a pointer, instead it will be an 8 byte variable, which is read directly. This is the only exception to the rule that it's a 4 byte class instance, though I'm considering adding the same feature to char arrays, which might in the future be allowed to use 8 bytes as well (not sure. Time will tell).

If it stores a type of enums, like EnumMap<TeamTypes, PlayerTypes>, then it will use memory like it's a short, except if the length is known at compile time and it's max 128, in which case it will store char. This means <TeamTypes, PlayerTypes> will be compiled as a char array of length MAX_TEAMS.

Written for We The People, which has the ability to optionally hardcode the length of all xml files, EnumMap is intentionally designed to get at much done as possible at compile time if the length is known, though it does support setting the length at runtime. Performance is still good though and certainly no worse than vanilla.

There is savegame code, but it's written specifically for custom code in WTP. This means anything related to savegames will have to be modified.

Feel free to use what you can and feedback is most welcome.

MattCA · Mar 17, 2020

f1rpo said:
I've defined a macro so that GlobalDefines can be cached at CvGlobals by just adding the tag name to the macro (GitHub link).

@f1rpo Interesting stuff. Can I copy?

f1rpo · Mar 17, 2020

MattCA said:
Interesting stuff. Can I copy?

As far as I'm concerned, sure. Might want to copy the latest: CvGlobals.h/.cpp
Though, apparently, I've only changed the name of the GlobalDefines macro.

I've tried a similar approach with the CvInfo classes, to make it easier to introduce new XML tags, but the result is a bit unwieldy: CvInfo_Terrain.h/.cpp
Client code would be e.g.:

Code:

int iHealthPercent = GC.getInfo(eImprovement).get(CvImprovementInfo::HealthPercent);

--
A small caveat that I had meant to add about using static local variables: Those won't get updated by setDefineINT.

raystuttgart · Jan 1, 2022

Thanks for this thread. :thumbsup:

Learned a bit again about potential performance optimization of DLL source code.
(Although I must admit that I did not even understand half of this thread.)

@billw2015
I am not sure if this is a thing for C2C as well, but in our mod WTP I noticed that some performance was lost because of wasted "cashing potential"
e.g. Trait Modifiers bypassing "processTrait" (unnecessarily looping all Traits every turn instead of having a value from "processing once" cashed at Player) --> see here

Maybe check if in C2C "cashing potential" is also wasted. (For Traits, Buildings, Civics, Techs, Promotions, ...)
Since I do not know your source code I can not know if that is a thing for your mod as well of course. :dunno:

Especially the more Traits, Buildings, Civics, Traits, Promotions, ... you have in your mod, the more noticable this may get.
It is probably not the world in performance gain though, but every bit can count in the end if a mod gets really big.

MattCA · Jan 3, 2022

Interesting. Maybe I'll check that out.
There are lot's more simple things that can be cached in c2c. I have no idea how effective anything would be though.
- Living players could be cached instead of max player loop and calling isAlive().
- Players on team can be cached in CvTeam.
- plots could be cached in CvArea.

Toffer90 · Jan 4, 2022

MattCA said:
There are lot's more simple things that can be cached in c2c. I have no idea how effective anything would be though.
- Living players could be cached instead of max player loop and calling isAlive().

Would be miniscule on performance, as the loop is very small, 50 iterations, and some games all 50 may be alive.
The "is alive" status for players change rarely, so that is an argument for doing the caching, as having to often change the cache would not be worth the time saved by reducing the small loop by a couple iterations.

MattCA said:
- Players on team can be cached in CvTeam.
- plots could be cached in CvArea.

Caching plots in CvArea, and teams in CvTeam would probably be worthwhile, as a change in either cache is extremely rare (terraforming and nukes can change what area a plot belongs to I guess, some future global warming feature would do this too), and it cuts down more than half of the loop iterations in pretty much any case, usually more as teams tends to be small (1-5 players) and there's usually many areas on a map meaning one area is unlikely to have most of the plots of the map in it.

Nightinggale · Jan 4, 2022

Regarding optimization, step one is always to profile. Use a fully optimized DLL with symbols included and then run a profiler while the AI takes a turn (well multiple with autoplay if you like). Very Sleepy might be primitive, but it gets the job done. If a function uses significantly less than 1% of the total time, then it doesn't appear to be interesting to optimize, not even if you can make it 30% faster.

isAlive() doesn't seem to be a good one to cache because the players seems to be one of the few things in vanilla, which is stored sequentially. This means the CPU will likely notice a the read pattern and prefetch the next bool before it's needed. This will reduce the memory latency, which is the biggest performance killer for this task.

Caching players in CvTeam seems to be a good idea because yes it will change rarely and there are a bunch of cases all over where the code needs "for all players in this team". How often something is used as read only vs how often it changes is the key here. My biggest single change in the code was caching profession cost in (back then RaR). It wouldn't be wrong to say it's called number of city plots*50 for each citizen. That would give the correct scale. Vanilla then in each inner iteration calculates something slow, which can be affected by.... traits... yeah traits. They change rarely enough to justify caching. Granted founding fathers (techs, which has the wonder approach to only be granted to one team) can give traits, but still rare enough to not really be concerned with it. This cache alone dropped the overall wait time from 40 seconds to 35 seconds in my test savegame. Yeah this is Colonization specific, but in a sense all optimization will be specific to a certain DLL once they have been modded enough and it does reveal the possible impact of a single cache if the code in question is slow enough.

My easiest optimization was inside the same loop a getDefineINT was used to get a value from XML. Caching this value in an int saved 1% of the overall wait time. This is because getDefineINT stores the string/int combos in an unordered map and the implementation sucks from a performance perspective. In general getDefineINT should be avoided, particularly in loops.

Toffer90 said:
Caching plots in CvArea

How useful this is depends on how often the code needs to loop all the plots in an area. I do not actually know the answer to this question. Investigating areas further is on my todo list.

MattCA · Jan 8, 2022

Nightinggale said:
CvInfoBase use 320 bytes for what? Some strings and a bunch of virtual function pointers. Most of this will never be used, though there is a risk that the exe expects the class memory layout to be like this.... not great.

I wonder if it's possible to make a base class for CvInfoBase (CvInfoBaseBase) and then make a second derived class (CvInfoBaseCustom) using the same base as CvInfoBase? Then use CvInfoBase for infos with DllExports and CvInfoBaseCustom for infos with no DllExports?

Nightinggale · Jan 8, 2022

MattCA said:
I wonder if it's possible to make a base class for CvInfoBase (CvInfoBaseBase) and then make a second derived class (CvInfoBaseCustom) using the same base as CvInfoBase? Then use CvInfoBase for infos with DllExports and CvInfoBaseCustom for infos with no DllExports?

There are two dangers from the exe:

Exe uses a pointer to a class instance and calls a DllExport function and expects arguments to be the same as in vanilla.
Exe uses a pointer to a class instance and calls a virtual function expecting the function pointer to be at a specific offset from the pointer address.

The first one is fairly easy to handle. Just look for the DllExport keyword. Vanilla and modders have added way too many DllExports though. It can be checked if the exe actually uses one with Dependency Walker. Note that both exes have to be checked for this as pitboss calls a bit differently than the regular exe.

The virtual function call is a pain as we have no warning in the code about this and failure to maintain the virtual functions at specific addresses will crash the game. I tried this with CvPlayer. I'm however not sure if it matters to any other class.

If you can be sure a class isn't affected by any of this, then you are free to do whatever you want. I suppose you can have a pure virtual CvInfoBaseBase class to use as a base for everything as pure virtual classes has no memory. I just don't see the purpose of it. If the info class exist only in the DLL and perhaps python, then why bother having the same base class? You can just have two base classes with the functions you need.

Speaking of optimizing the info classes, they are all stored in vectors of pointers to instances. If they are stored in vectors, then they will be stored sequentially, in which case looping all will make better use of the CPU cache, hence being faster. In general the civ4 engine seems to be designed based on memory access concepts designed during the era where memory latency wasn't an issue. The CPU was so slow that resolving pointers was cheap or even free when counting in CPU cycles. While this was no longer true when the engine was designed, it was still taught like it was true. Also the whole concept of virtual functions turned out to be flawed because of this, but nobody (or very few) had realized it at the time. It was back when "java will always be slower than C++, but future computers will be so fast that it won't matter" as something even university professors believed so I don't blame Firaxis for this. They did as everybody else did at the time. However the problem with the CPU being so fast that it has to wait for data from memory becomes more of a problem with each CPU generation meaning we know the computers we use today will benefit from optimizing where data is stored in memory. The exe won't care because we can still give it a vector of pointers if it request it. The exe won't care where the pointers point to in memory as long as the data is there.

This will require the xml reading code to be modded though. It is in the DLL, so it is possible. Reducing the memory size of info classes becomes more important if they are to be read sequentially as that would reduce the number of memory reads while looping even more. Personally I believe reducing memory usage for info classes matters more to performance than to actual memory usage because it's likely mainly a question of how much of an info class or how many info classes we can fit in the CPU cache.

EDIT: I just realized part of this post is also part of what the first post in this thread is about. It has been so long since I read it that I forgot and admittedly I didn't read the entire thread prior to adding one more post. Still worth repeating though.

Nightinggale · Jan 8, 2022

billw2015 said:
Sparse bool arrays - these seem to be used a lot, and in every case I so far looked at they are forcing the code to be slower instead of faster, and the memory usage to be higher than it should be (which also impacts performance).

I have worked on this issue in WTP. I have essentially made a block system where a uint32 is used to store 32 bools. Arrays then consists of X blocks. If the number of blocks happens to be one, then the memory allocated for the pointer will be used to store the bools instead, effectively removing the need to allocate memory elsewhere.

In the current implementation (sadly not very portable) it does compile time optimization and can use up to 2 blocks without using pointers if the length is known at compile time. This is then used for stuff like CvPlot storing which teams have ever seen said plot.

billw2015 said:
Bake the static portions of the AI scoring - a large part of the valuation that the AI gives to buildings when deciding which to build is based off their raw numbers and not contingent on any external factors, and thus can be pre-computed, instead of computed for every city, for every building, every turn. Probably the same is true of unit builds.

Pretty much what I wrote about caching yield costs of professions. The data is player specific as it depends on traits so get one 2D array (professions * yields) in each CvPlayer and boom.... much faster. In my specific case the AI went from 40 to 35 seconds based on just adding this one table. It's my impression that vanilla underused caches in general, which makes modding easier as inexperienced modders won't break caches, but at the same time it is kind of slow.

Implementing a cache isn't tricky. The tricky part is to figure out when to mark the cache dirty for recalculation. Ideally a cache is based on xml values only, in which case it can be cached the first time the main menu is opened. The most obvious way to use this is to say make a list in CvUnitClassInfo, which lists all CvUnitInfos using said class. This can make xml data stored in a way, which is fast to access at runtime without having to mess with the xml files themselves.

MattCA · Jan 8, 2022

There is a vector of pointers to all the info classes. It's used in a couple spots. It would be nice not to break that stuff. And like you said, CvInfoBaseBase could probably be pure virtual.

std::vector<std::vector<CvInfoBase *> *> m_aInfoVectors;

Nightinggale · Jan 8, 2022

MattCA said:
There is a vector of pointers to all the info classes.

I'm only aware of one use case for it offhand, which is in CvGlobal's deconstructor. It loops through each info class instance and frees the memory to avoid memory leaking. If I understand it correctly, this isn't even that important because who cares if it leaks memory while quitting? I fully agree that it should be investigated if it is used for anything else at all prior to making anything, which might be broken.

MattCA said:
And like you said, CvInfoBaseBase could probably be pure virtual.

Having a pure virtual base class could be used to add functions to read from xml. Sure it would be a frontend to the vanilla functions, but if used correctly, human readable error messages could be added. The more strict the DLL is regarding reading xml, the less risk there is of xml bugs.

DLL optimizations I'm considering.

OPTIMIZED?

OPTIMIZED!

DE-OPTIMIZED!

King

Deity

King

Deity

King

Attachments

plastics

Deity

Prince

Deity

Prince

plastics

Civ4Col Modder

Prince

C2C Modder

Deity

Prince

Deity

Deity

Prince

Deity

Similar threads