So the loop doesn't get unrolled, but the compiler is aware that it's exactly 3 iterations. I've given CvPlot::getYield an inline definition, everything in EnumMap is also inline, and the compiler has followed suit and eliminated those function calls.
I have been thinking a bit about this and it's actually not as bad as it seemed at first. The inlining works. The loop stop condition is a simple comparison against a fixed number rather than the return value of a function call. This means it will be faster even if the branch prediction fails. However the stop condition is fairly simple, which makes me wonder: will the branch prediction in the CPU be able to figure out it will execute 3 times? Maybe. At least the code will give the CPU a much better chance at being clever here compared to what it can do with vanilla code.
It also suprises me that this rather large function has an inline keyword.
That's because the inline keyword doesn't do what most people assume it to do. It's a recommendation, not an order. This means the compiler will consider inlining the function, but it is allowed to not do so. Also (particularly true with modern compilers), the compiler is free to inline even if the keyword is missing. This means the value of the inline keyword for performance isn't what you would expect at first glance.
If the function isn't inlined, then it will be compiled into every object file. The linker will then detect multiple cases of the same function and fail to link. However if the functions are declared inline, the linker will then detect them to be identical and merge them into a single instance, hence acting like it's a single instance from a cpp file.
Adding template functions in a cpp file means the cpp file has to be aware of which templates to compile for. In other words the programmers needs to manually maintain a list of possible templates for EnumMap if the functions are in a cpp file.
Placing all the functions in the header will avoid the need for manual handling of which types to use with EnumMap and declaring them inline will solve the linker issue. Maybe they will be inlined, maybe not. That's up to the compiler.
There is also another issue to consider when looking at inline. A lot of the big functions have if-else structures (if not downright switch case) where the branch condition is set at compile time. This means a lot of the code is compiled and then removed by the optimization. In other words it's possible that less than 10% of the lines in a function will actually be used.
In this specific case adding a template function seems like a good idea if we want to optimize further. In fact for the hardcoded length arrays using templates to unroll the loop appears to be an option.
A unit test for EnumMap shouldn't require anything from the Civ 4 EXE, so one could create a new EXE for running tests. Then again, test code for other classes might require the Civ 4 EXE to be loaded, so perhaps it's better to run all tests after loading Civ 4 and the mod and maybe even a special savegame.
I was thinking something more simple, like calling a function after loading the xml data. This function can then declare an EnumMap, then asssert(get), set, assert(get), reset, assert(get) etc. Add some code to make it compile only in debug mode to skip it in releases and EnumMap will be automatically tested every time somebody starts a debug build. If the tests can all be completed within a second, then it doesn't matter they are tested this frequently. In fact it would be a bonus.
That's all I need – more template parameters.
The idea is that you don't have to consider the templates. It should be like get() calls _get<T>(). The "outside world" won't know it's a template. Sure it's an extra function call, but all it does is returning the return value from another function call. It has a really good chance of being inlined, hence vanished from a runtime perspective.
For the time being I won't use templates to optimize further. Instead I will write a lot of test code. That way when I get around to template optimizations, the tests will instantly tell if I mess up, be it logical error or copy paste error (again). It doesn't matter how fast the code is if it is buggy, hence the need to be bugfree comes first.
Having a getTotal function at the base class sounds useful in any case. That function would recompute the sum on every call – maybe using template meta-programming (not if I had to write it). And then the client code has to say whether to instantiate a wrapper class that stores the total.
I think I will write this in a simple (slow) way, then add a bunch of tests. Only after that should fancy fast code be considered.