New savegame format

I managed to profile using Time Stamp Counter, something, which is available as __rdtsc today, but not in msvc 2003. It turns out that for short arrays (like 3), getTotal use the same amount of cycles with and without the template approach. For a large array (174 specifically), the loop appears to be using around 25% more cycles (though accuracy in the measurement is questionable due to low number, hence noise level). The template approach will however apparently multiply the cycles used with the number of elements in the array, indicating that it's not actually inlining when it pass a certain number of elements. This means the template approach takes more than 150 times as many cycles as the loop when it's adding up the elements, which is actually consistent with the expected delay of 174 function calls.

Conclusion: the template loop unroll approach dies. Maybe it worked when the compiler was new, but it doesn't work on modern CPUs. Also adding 174 elements takes around 27 cycles. Evidently the CPU does manage to unroll the loop at runtime and maximize the gain from out of order executions. There is no need for us to do anything.

Also I need to write a proper implementation of the profiling I used to do this. It can actually measure differences in function calls, which are so fast that clock() will round the time to 0. A proper implementation would be a class, which has start and stop functions, which hides the fact that they have to call asm.
 
It turns out that for short arrays (like 3), getTotal use the same amount of cycles with and without the template approach. [...] Evidently the CPU does manage to unroll the loop at runtime and maximize the gain from out of order executions.
That's a definite answer – and good news I guess. Though you were right that the recursive metaprogram wasn't so bad in terms of readability.
Also in case you wonder what test code looks like for a mod, here is what I came up with. It might not look elegant, but it did catch a few minor issues, meaning it gets the job done just fine.
For most of the code I write, I find it infeasible to add automated tests (unfortunately). For example, I'm about to revise CvPlayerAI::AI_cityTradeVal. Some sophisticated testing framework probably wouldn't get used much. The EnumMap tests look so fast that no delay should even be noticeable (and one can always disable them once the code has matured).

Thanks for elaborating about the inline keyword. I see now that you've included a comment in the code explaining that largish functions are (explicitly) inlined to avoid multiple-definition errors. I must've glanced over that part; :o otherwise I wouldn't have brought it up at all. Since I did bring it up:

An inline keyword for class member functions seems to be dispensable. At least I can't manage to provoke a linker error if I remove the inline keywords from all of them. (For the record, I'm only using EnumMap in CvArea and CvPlot so far.) Apparently, the C++ standard says that "a function defined within a class definition is an inline function." hasContent isn't actually defined within a class definition, but the compiler seems to treat it as inline wrt. multiple definitions all the same.

If I got that right, then we could avoid recommending inline expansion for large member functions. I suppose you're right however that there is no harm in such a recommendation. Even if I use the keyword, hasContent doesn't get inlined, i.e. the debugger still steps into the function when I compile with optimizations. (And the unreachable branches are indeed removed.) Hm ... I suppose with templates, it could be that a function gets inlined for some template arguments and not for others.
 
I see now that you've included a comment in the code explaining that largish functions are (explicitly) inlined to avoid multiple-definition errors. I must've glanced over that part;
I knew that even before I wrote the comment. The answer is at 8:43


For most of the code I write, I find it infeasible to add automated tests (unfortunately).
Same here. Most code is like "if plot is grassland and no hills and unit is..., then whatever". Most of the code is really hard to isolate for automated tests. However EnumMap is not interacting with game logic other than reading the lengths of xml files. That makes it easier to test.

An inline keyword for class member functions seems to be dispensable. At least I can't manage to provoke a linker error if I remove the inline keywords from all of them. (For the record, I'm only using EnumMap in CvArea and CvPlot so far.) Apparently, the C++ standard says that "a function defined within a class definition is an inline function." hasContent isn't actually defined within a class definition, but the compiler seems to treat it as inline wrt. multiple definitions all the same.
For most purposes it doesn't really matter if the function is inside the class or an inline function after the class. However for template functions for template classes it actually does matter. You can't make a specialized function of a non-specialized class, meaning the specialized functions needs to be in the class to work. At the same time not specializing the function seems to be a problem unless the class isn't specialized either, meaning those should be outside the class declaration.

EnumMap is pushing the limit of what the 2003 compiler can do with templates.

I suppose with templates, it could be that a function gets inlined for some template arguments and not for others.
This is very likely. For instance getLength() is likely inlined if it returns a compile time constant. If you make it determine the length at runtime, the chance of inlining will likely not be as great, particularly not if you use GC.getNum*Infos(). The beauty of templates is that it works either way.

My current plan is to make some sort of standardized performance test and then clean up functions, which can be written in more ways than one. Particularly the template unrolling stuff needs to go away, but rather than fixing random spots, I want to be systematic about this. If EnumMap is to become the standard way to handle arrays it better be optimized as much as possible. Even minor performance differences can add up if called frequently enough.
 
I added Profile.cpp/h to the savegame branch and updated getTotal. Now I ran into a mystery.
PHP:
   if (!bINLINE && !isAllocated())
   {
       // no need to loop through unallocated memory
       return DEFAULT * getLength();
   }
   else
   {
       int iReturnVal = 0;
       const int iLength = getLength();
       for (IndexType eIndex = First(); eIndex < iLength; ++eIndex)
       {
           iReturnVal += get(eIndex);
       }
       return iReturnVal;
   }

When used with an array length of 171, if the array isn't allocated, it takes 23 CPU cycles to call this. If the array is allocated, it takes 178. Both numbers are reasonable. If I remove the if statement and make it always go into the else part, it will take 314-330 cycles. Why is the else statement running at half speed if the if statement is removed :confused:

Obviously I committed the fastest version, but I'm wondering why reducing code complexity will slow down the execution significantly. Is it because the compiler is stupid?

Also I removed the template unrolled loop because it's too slow. Again writing multiple versions of code to do the same and pick the one, which turns out to be the fastest based on profiling.
 
Why is the else statement running at half speed if the if statement is removed :confused:
I'm getting similar numbers in tests with an EnumMap<BuildingTypes,int> (NUM_BUILDING_TYPES unknown at compile time): 345 cycles vs. 609. The asm looks OK; I'm attaching a screenshot of a WinMerge diff (with some background color improperly removed by means of a paint bucket). The stack memory allocation upfront differs a little, but the part after iReturnVal=0 is apparently equivalent with and without the isAllocated check, in particular the actual loop (in between the "###...###" comments). The samples also seem to get recorded correctly. Weird.

I've been looking (just in CvPlot and CvArea so far) for places to use getTotal in. The cached sums probably should stay cached (CvArea::m_iNumUnits, m_iNumCities, m_iTotalPopulation). So I only found CvArea::getNumTotalBonuses – and it turns out that, in BtS, the result only gets compared with 0, so a CvArea::isAnyBonus function that calls EnumMap::hasContent is better.

Something (unrelated) I learned about the inline keywords today: The Ob1 (Kenobi *scnr*) compiler option allows inline expansion only for functions with an inline keyword. I'm tentatively using that for debug builds now. In a test with attached debugger, this reduced AI turn times by 30%.
I knew that even before I wrote the comment. The answer is at 8:43 [video]
I wish he had given that talk about indentation at Firaxis Games in 2004.
 

Attachments

  • getTotal_asm.jpg
    getTotal_asm.jpg
    733.9 KB · Views: 217
RAR had 4 MB save file limitation, which limited playing on huge maps.It would be great if this could be enlarged.
I'm not entirely sure it's a size limitation. Rather it seems like it could be a bytes/second limitation. It's like if the savegame is too big and we use a breakpoint to make the debugger force the game to take a break while loading, then it won't crash. This is really annoying because it means if the game runs, it will crash. If we want to look at what goes wrong, then it won't crash. Something odd is going on with the savegame code in the exe.

I'm getting similar numbers in tests [...] The asm looks OK [...] The samples also seem to get recorded correctly. Weird.
I think we are at the point where we should just declare it to be some sort of CPU runtime optimization or similar, which we do not know anything about. It's an important lesson though because it shows that we should test the performance of everything and not assume anything. This time the result is counter intuitive. Doing whatever is "the fastest solution" based on profiling will have to be good enough and then we ignore the "why is it the fastest solution".

the result only gets compared with 0 [...] EnumMap::hasContent is better.
Obviously because it bails out on the first non-zero value. It's also not the same result because if you end up with both positive and negative numbers (possibly due to a bug), you can by chance end up with a total of 0 despite not having only zero values. hasContent is vastly superior when it comes to checking bool values because it checks one block (32 bit) for each comparison meaning the number of comparisons and iterations can be divided by 32 (rounded up) compared to checking each value individually. While the design goal is to hide the EnumMap internal workings from the outside code, it doesn't mean that all solutions on how to get data will be equally fast even if they end up with the same result.

(Kenobi *scnr*)
Well, we all know you felt an urge force to write that.
 
Top Bottom