DLL optimizations I'm considering.

PPQ_Purple · Jan 8, 2022

Speaking as someone with a little bit of C++ experience I'll share my thoughts. Now, these are my thoughts and just that, nothing more. They are NOT the absolute truth of the matter. I have been working with C++ for some 20 years now but that does not mean I know everything. But they are mine and I figured you'd be interested:

Massive classes - the bigger an object is in memory the less of it fits in cache.

Modern processors have caches which are large enough that this is not a concern. It wasn't really since like the x86 days. For reference I have an old laptop from like 10 years ago and it has an intel core i5 CPU that has a total of 4MB of L3 cache. You are not going to reasonably fill up that sort of space with code before multitasking makes you clear out.

Therefore I assume you are talking about memory and paging. Now that is a much more serious concern. But there too the size of a class just isn't a concern. You just aren't going to be filling up your RAM with code but with assets, mostly textures but also data of various types. And you just can't really get away from that. It does not matter if the data about your unit is split into 3 classes or all in one if it has to all be loaded concurrently for the program to get it's information. In fact, such division might make page faults more not less frequent.

So any memory optimizations should be directed not at the DLL but at minimizing those in terms of size, numbers and sharing where ever possible. Well that and looking into if the game really does just serialize ALL the XML and keep it in a giant clump somewhere in memory. That might need looking into. But again, carefully.

Vectors of pointers - this is bad performance wise in almost all cases. Of course some design requirements might make this impossible, but, as an example, the replacement system, where this is most used, is NOT one of them. At least as far as I can tell thus far from my surface reading of the code).

Yes and no. C++ vectors are just arrays with additional code added on to make them expandable. And I mean that quite literally. It's in the specification. What this means is that your typical CRUD operations aren't going to have worse performance than if you'd have used an array.

You only get significant overhead when you get to expanding the vector by adding elements or shrink the vector size. In those situations your program quite literally copies the whole array to a different larger one element by element. But unless you are doing that often you shouldn't worry about array vs vector performance. And if you are doing that often enough for it to be a concern you should consider using a different data structure altogether like a linked list. But that complicates access a lot.

So just saying vectors = bad is not correct. In fact, for the sort of applications I've seen CIV use them they are the right choice.

Pointers in general - they shouldn't be the default for how to instantiate an object, but they seem to be in this code, so I'm going to look into that. Maybe most or all uses are the correct choice, I shall see.

I am curious as to what else you would use. I mean, this is C++ we are talking about. Pointers of some description are literally the correct way to hold references to non primitive types. Unless you mean they are using raw pointers everywhere in which case that's not a performance concern but a memory management one.

I just hope that CIV does not actually use C style code with malloc instead of new. But that's about it.

Const correctness - yes the optimizer can in fact perform better when it knows that code doesn't have side effects. It will also lead to much easier ability to apply...

I can not say anything negative about wanting methods to not have side effects.

Data parallelization - once you know that the contents of a for loop *actually* don't have side effects (not that const actually guarantees it, but it is most of the way there), you can run it on more than one thread so it finishes faster. I don't know how many places could currently benefit from this, but some places that currently can't might be refactored such that they could.

The problem with concurrency is that it's not that simple. Sure, you can parallelize a loop here and there but that isn't going to net you much of a result in the big picture. In order to make proper use of parallelization you have to actually consider the entire algorithm that the method in question uses as well as those of the entire calling chain both front and back and change them into something that makes sense to run in parallel. It's a much larger project than just throwing in a few threads to speed up a loop. Indeed, taking the shallow approach will probably actually leave you worse off because threading it self requires a certain amount of overhead.

Not that I would be opposed to a more multi threaded DLL. But I am saying it is NOT going to be easy or simple to do and that the questions you should really be looking at are stuff like "how can I make the AI consider some of its unit actions in parallel" instead of "how do I make some small 100 item loop go in parallel".

Branching - its slow. Especially convoluted branching will hurt the branch predictor and the optimizer. I already noticed plenty of redundant branching as well, where constraints are checked multiple times.

Now this is something to look into. From what I have seen in modded CIV DLL code there is both a lot of redundant branching and a lack of result caching going on there.

Bake the static portions of the AI scoring - a large part of the valuation that the AI gives to buildings when deciding which to build is based off their raw numbers and not contingent on any external factors, and thus can be pre-computed, instead of computed for every city, for every building, every turn. Probably the same is true of unit builds.

Now THIS is exactly the sort of thing you should definitively be looking into.

Better utilization of caches once they are formed, e.g. longer term city build planning for the AI.

Would long term planning not reduce perceived performance by making the "planning" turns take longer? My concern is that even if you managed to cut down the absolute total time if that meant every Nth turn took way longer to finish than the rest that would make for a worse overall user experience.

So this is a good idea but it would need to be tested.

More accurate cache invalidation rules - caches don't need to be cleared at the start of a round, instead they can be cleared or partially cleared when values known to effect them have changed.

As long as the overhead to track this is not significant I agree. Just always remember to keep track of the overhead.

Multi-level cache with standardized representation, from which invalidation rules can be derived automatically - this would be the ultimate way to model the caches. Not a trivial task.

I am not opposed to the idea but must warn you based on my experience that you should plan this out carefully lest you create a monster class that does everything and results in more overhead than you gain by using it. This is a serious concern to keep track.

Queue multiple buildings - the AI is already evaluating and assigning a score to every possible building every time it changes production. Why not just put the top 5 rated buildings in the queue straight away? Then decide each turn if the AI feels strongly about clearing the rest of the queue due to some significant change in circumstances (war, tech with an un-built wonder in it etc.) This is how most humans play after all.

This sounds about right honestly.

Nightinggale · Jan 9, 2022

PPQ_Purple said:
Modern processors have caches which are large enough that this is not a concern. It wasn't really since like the x86 days. For reference I have an old laptop from like 10 years ago and it has an intel core i5 CPU that has a total of 4MB of L3 cache. You are not going to reasonably fill up that sort of space with code before multitasking makes you clear out.

I question this statement for three reasons:

It assumes everybody to rush out to buy a new CPU
It assumes optimal usage of the cache
L1 and L2 caches are still limited in size

When caching memory, the CPU stores a block. This means even if it only needs an int, it will store something like 4 kB. In other words with really poor memory layout management we could end up wasting 99% of the cache. In reality it's likely way better than that, but still nowhere near 0% waste. On top of that, it takes 40 CPU cycles to read from the L3 cache (Ryzen 3000 series, CPU dependent number) as opposed to 6 cycles for the L1 cache. Both are way better than the 1000+ cycles spent on reading uncached data from RAM, but using L1 instead of L3 still matters.

It should be noted that optimizing for low memory in the compiler options is still slower than optimizing for speed. Low memory has more overhead as it will more aggressively use loops and similar.

PPQ_Purple said:
Yes and no. C++ vectors are just arrays with additional code added on to make them expandable. And I mean that quite literally. It's in the specification. What this means is that your typical CRUD operations aren't going to have worse performance than if you'd have used an array.

Everything you write here is correct, but also missing the point. A vector is a number of instances of the same type of variable, all places sequentially. The issue is when the stored variable is a pointer. This means once you have the data from the vector, you would still need to follow a pointer to the actual data. This isn't the fault of vectors, but rather how the vector is used. A C style array of pointers would have the same issue.

Even without a cache miss, the extra step to resolve another pointer step is overhead, which will slow down the CPU.

PPQ_Purple said:
I am curious as to what else you would use. I mean, this is C++ we are talking about. Pointers of some description are literally the correct way to hold references to non primitive types. Unless you mean they are using raw pointers everywhere in which case that's not a performance concern but a memory management one.

I think it refers to using a pointer to an object rather than having the object itself locally. A pointer can be faster though if for instance it's a 404 bytes class it points to. It means the class containing the pointer will be reduced by 400 bytes if a pointer is used, hence not having a huge block of "unused" memory as part of the class itself when accessing multiple variables in the class instance.

Also speaking of pointers, vanilla is using pointers in cases where references should be considered instead. They compile to the same thing, but references can't be NULL. This way having an argument or return value as a reference instead of a pointer is telling future programmers that this function doesn't handle NULL.

There is one issue with references though. I found a way to screw up. Returning a reference and then store it in a non-reference. This will create a copy, but if the class contains pointers, the copy will use the same pointers and free them with the deconstructor. I fixed this by making the classes in question non-copyable, like this

PHP:

class CvInfoBase : private boost::noncopyable

This way a reference can be stored in a reference, but storing it in a normal variable will cause a compiler error. This in in a sense unrelated to performance, but in another way it isn't because you can use GC.getUnitInfo(x) and store the result in a reference to avoid making that call over and over on the same unit.

PPQ_Purple · Jan 10, 2022

Honestly, and I mean absolutely no offense by any of this, what you say strikes me as one of those theoretical problems I used to see in university back in the day. The sort of stuff where the computer science part checks out but the software engineering part is too impractical. And in my experience the sort of penny pinching with single digit CPU instructions and bytes of memory is only ever justified in situations where you have absolutely no other choice either because your software is running hilariously bad or because you are working with extremely restricted hardware. For example I had to do this once or twice back when I was doing work with x86 assembly in school or when I was working with microcontrollers that have really restricted memory on them. But I would newer try and do it in C++ on a big project like this unless forced at gunpoint.

And the reason for this is because writing your code in a way that optimizes in this way or worse yet rewriting existing code to do so always results in code that is more difficult to maintain, read, understand and generally work with. And the benefits can only usually be attained by actually sitting down next to your machine with a profiler.

To put it another way, I am not disputing your computer science but I question the cost / benefit ratio of actually putting it into practice. But if you do do it I would be very curious to see how it pans out. Here's to hoping you can prove me wrong.

As for that reference argument thing that's an interesting point and one I admit to not have considered. That's why I asked what else you'd be using. I thought you were talking about fields inside class headers and stuff rather than function parameters.

Either way I'll deffer to you on that one. It's one of those things that has its own pros and con's as you demonstrated and can go either way based on personal preference and how the existing codebase is set up.

Nightinggale · Jan 10, 2022

PPQ_Purple said:
Honestly, and I mean absolutely no offense by any of this, what you say strikes me as one of those theoretical problems I used to see in university back in the day. The sort of stuff where the computer science part checks out but the software engineering part is too impractical.

Going back to university days, I was presented with code during a lecture, which did some matrix calculations. Normal C code using loops to walk through arrays. Same code was then rewritten to be ugly and using pointers exclusively. This reduced runtime by 90% as it ended up doing the calculations without the overhead for converting to and from array indexes. This was taken from production code in a hard real time system meaning performance critical. Think self driving car kind of hard real time critical.

I do get what you are saying. Big O notation is something, which has the most focus in education. It's not wrong to consider it, but the practical value can easily be quite far from theory. For instance O(n^3) is far slower than O(n log n) so theory says we should use the latter. However O(n^3) can be faster for n=50 than for O(n log n) with n=500k. When teaching the theory, the option of changing the approach to lower n itself is never mentioned, yet you can encounter it in real life.

I will say that what is mentioned here is valid and will likely impact more than what you realize. Memory latency is a performance killer and it gets worse with each new CPU generation. Being concerned with how memory is utilized related to cache is a valid discussion. It's not a replacement to other performance improving approaches, like caching ints rather than looping to calculate it each time it's used, but nobody stated memory layout should be optimized instead of other optimization approaches.

Is it worth the time? Depends on the code and who you ask. I will say everything written in this thread is valid to include in a thread on this topic. How much comes to practical use in one or more mods is a different question, but it's valid to bring up for consideration. It's not like I have done everything mentioned here because that would quite literally take years to implement in an existing source code of this size, at least when using the unpaid free time approach that modders do.

Not wanting to try to do some or all of what is mentioned in this thread is fine. This is clearly for people who wants to improve performance. I might even go as far as to say some might only be well suited for people who view optimization as a game itself and hence CPU turn time as the score to beat. How important each type of optimization is highly depends on the mod. If the mod doesn't have performance issues to begin with, then why bother? If your turn timer is 5 minutes then profile to figure out where to focus on optimization.

PPQ_Purple said:
And the reason for this is because writing your code in a way that optimizes in this way or worse yet rewriting existing code to do so always results in code that is more difficult to maintain, read, understand and generally work with. And the benefits can only usually be attained by actually sitting down next to your machine with a profiler.

This I fully agree with. Use a profiler to figure out where you code is slow and work from there. Being concerned with memory layout is something you can look into once you have no clear places to optimize.

I also agree that using unreadable code is problematic. For starters, use more classes if the code becomes complex. If your array is more complex than just an array or vector, then make it a class, create instances and to anything outside of the class you use get and set, perhaps overloading []. If black magic happens, it happens inside that class and only there. If there is an issue, fix it inside that class and it's fixed for all the places, which uses said class. This approach is good even in cases where optimization isn't a concern.

In general civ4 makes too little use of "helper classes". Read a string from xml and turn it into an int based on the type. Make it a class/function and add human readable error messages in it. There are a whole lot of cases where small classes like this would be beneficially to error handling/reporting, clean code, code reusability and similar. If said class can be made faster than the existing code, then good. If not, then that's fine too.

MattCA · Jan 14, 2022

Nightinggale said:
vanilla is using pointers in cases where references should be considered instead. They compile to the same thing, but references can't be NULL. This way having an argument or return value as a reference instead of a pointer is telling future programmers that this function doesn't handle NULL.

I like that. Maybe I'll add that to billw's cpp style guide.
CPP Style Guide · caveman2cosmos/Caveman2Cosmos Wiki (github.com)

DLL optimizations I'm considering.

OPTIMIZED?

OPTIMIZED!

DE-OPTIMIZED!

PPQ_Purple

Purple Cube (retired)

Nightinggale

Deity

PPQ_Purple

Purple Cube (retired)

Nightinggale

Deity

MattCA

Prince

Similar threads