New savegame format

Nightinggale

Deity
Joined
Feb 2, 2009
Messages
5,378
I mentioned a new savegame format a few times. I have now written an overview of what it's all about for people who wants to read more about it. https://github.com/We-the-People-civ4col-mod/Mod/wiki/Savegame-format

I will say again if you would be interested in joining the team and do some programming, this is a great way to start. You get to see the different parts of the code, the task doesn't depend highly on knowledge of the code and there are tasks for people of all skill levels.

It's not that this won't be done without more people working on this, but the more people we have, the fast it will be. The 2.8 release will not happen until all the savegame code has been converted.

If you have any questions or comments, then this is a thread dedicated to the savegame format, so post away.
 
RAR had 4 MB save file limitation, which limited playing on huge maps.It would be great if this could be enlarged.
 
Are you sure it's 4 MB? When I start to do math on saved data, it doesn't seem right. Either way there is a limitation at some point, which is in the exe and we can't do anything about it. All we can do is to save more data within that limitation. Current or planned features of the new savegame format:
  • Skip saving unused data (not saving a bonus in a plot is the same as saving NO_BONUS)
  • Use less bytes to store some specific data (don't use 4 bytes when 1 is enough)
  • Store partial arrays (lists). Useful if the list has 150 entries, but only one is different from 0
  • compress (zip) the saved data
For instance if we look at the newest revision right now, the change is to how a plot stores which teams have explored the plot. Before it was using 1 byte in the savegame if no team have explored the plot and 257 bytes if at least one team has explored it. The game also used 260 bytes of memory to store this data during the game. With the update savegame and game memory will use 8 bytes and it doesn't change if a team explores the plot or not. This will reduce the savegame size by around 1 kB for every 4 plots. That's quite a lot considering it's saving precisely the same data and without compression. Data size isn't just about what you store, but also how you store it.

Does this have other effects than just memory/file size usage? Yes, it also flattens the memory layout (no pointers), which is beneficial to memory I/O speed and it will help the CPU to use the CPU cache more efficiently, which also boosts memory I/O. It does however have a negative sideeffect for multithreaded execution, but that is very unlikely to affect us in this case even if we add multithreading in the future.

As for compression, the exe actually has some way of compressing data, which is apparently unused. However using a modern compression library like 7z will likely produce smaller files because odds are that compression have improved at some point during the last 15 years.

There is one other aspect to savegame size, which is important to remember and that's network games. When a player joins, the host pauses (freezes) the game, saves and then transmit the savegame. The game won't unfreeze until the transmission is complete. If the savegame is smaller, the transmission should be faster, hence shorter freeze whenever somebody joins.

The primary goal is to make the savegames less prone to breaking (either due to programming bugs or xml changes), but small savegame file size is also a declared goal.
 
The work on the new savegame format is moving forward. The biggest change is that the conversion table has been fully automated. The savegame saves the types from xml, which it use to figure out how to load a savegame if xml has changed. However this approach has a problem. Which files to include?
Now the game creates two savegames in parallel, one is the regular savegame and the other is data for the conversion table. It adds files as data from those files are saved meaning a savegame can no longer break because a programmer failed to include certain data. Also if a file isn't needed, then it's not included, which keeps the filesize down. Once the savegames are fully made, both are written to the same file and the player won't notice the difference. All there is to it is the removal of likely source of perhaps savegame breaking bugs.

There are plenty of ways to keep the savegame size down. The strings for the conversion saves just half the strings, like UNIT_FARMER will only save FARMER because all entries in that file starts with UNIT_. There is no need to save lots of those. Not new, but I haven't mentioned this before.

The variable ID is now saved in one byte instead of two. If a class ends up with more than 256 variables (present and historic), then it will automatically switch to using two to fit all IDs and it can do that without losing the ability to load savegames from before it's upgraded to 2 bytes.

Indexes to xml files are now saved in one byte instead of 4. This being something like which type of unit a certain unit is. If a file has more than 250 entries (that's a lot!), it will switch to two bytes (there are few reserved IDs like NO_TERRAIN), again without breaking the ability to load old savegames.

Saving which road and improvement the teams can see on a plot (possibly outdated in fog of war) has gone from 198 bytes to 9 bytes + 4 bytes for each team, which can see something, which is out of date. This change also changes how this info is stored in memory, which makes it a bit faster at runtime (yeah, faster game, but you likely too minor for players to notice).

River crossing data for a plot is down from 33 to 1 byte. Same with memory usage and a slight improvement to memory I/O.

In addition to this, default data still use 0 bytes, meaning if a plot has no route and no improvement and nobody assumes that it has, then it's not even using 9 bytes, it's 0.

There is also a major change to how the game saves a list of numbers where there is one number for each team/player. Take for instance culture on a plot. It has gone from 197 to 198 bytes. What we gain from adding that byte is the ability to skip players with no culture and the more realistic number would be 5 bytes for each player with culture on the plot. This means if a European and a native fight for control of a plot, the savegame has gone fro 197 to 10 bytes to save that situation.

The beauty of this is that the vast majority of those reductions are done in the savegame engine itself and not in the save code meaning from a programming point of view, this happens automatically and once it was added, it also applied to already written save/load code without updating that code.

It might be odd to talk about a file of possibly multiple MB and then reduce size by a few bytes, but that's how to reduce the overall size. The game doesn't have the "1 MB section". Rather it's a whole lot of chunks of less than 1 kB and to give an idea of the overall impact you have to look at the changes in percentage, not in the absolute values.

On top of all the size reduction changes, I added a whole lot of debug information to the savegames. Some of it being test to allow humans to read what the data is supposed to be and some of it is data, which the game use to automate bug detection. This means currently the savegames are actually bigger than they used to be, but this will be turned off before release. For the time being I prefer the game to keep automated testing even though it has yet to actually catch a bug.

TL;DR: the progress for reducing the savegame size is going great.
 
I've just yesterday merged EnumMap and EnumMap2D into AdvCiv (Git commit). I haven't adapted the actual SavegameReader/Writer; that'll be a lot more more work I expect. Also, I'm using EnumMaps only for CvArea so far. That's a class that I knew would benefit a lot from just-in-time initialization as water areas, such as 1-tile lakes, don't need any of the BtS array members. In AdvCiv, this is especially bad because sea ice can separate water areas. I've done a benchmark test before and after, and observed a speedup of more than 4%.

I haven't scrutinized the implementation of the EnumMap classes – for a combination of lack of time and being a bit out of my depth.
 
Wow! Can this work be adapted to BtS mods too in the future?
Yes. There is nothing in it, which is Colonization only. Remember Colonization is a modded BTS and the parts this relies on (savegame stream, CvBaseInfo etc) are all unchanged, meaning to a certain degree it's plug-n-play. The problem is that adding it does nothing by itself. The existing savegame code needs to be rewritten to go from the vanilla stream into the CvSavegame buffer for this to work, meaning effectively all read and write functions needs to be rewritten from scratch to get the full effect. That's not a Colo vs BTS issue, that would be an issue even if you move from one BTS mod to another BTS mod.

Also there is some adjustments to be made as well because it needs to get a list of xml files explicitly and BTS doesn't have professions etc. However this is a trivial issue, not even just by comparison, more like it really is just add/remove lines in a few lists where there is one file on each line.

I've just yesterday merged EnumMap and EnumMap2D into AdvCiv (Git commit). [...] I've done a benchmark test before and after, and observed a speedup of more than 4%.
Interesting. While it's primarily written to avoid bugs and reduce memory usage, it does speed up execution in one specific case. In vanilla, get(int) will find the pointer in memory, then look up that location that the pointer is pointing to + index and then return whatever is in memory there. EnumMap does the same, except if the array contains nothing but 0, it will not allocate the array. This saves memory, but also the get function will return 0 because not allocated means everything is default. This means at runtime, the memory for the pointer in area is looked up, but it checks if it's 0 and if it is, don't access memory again. This means one memory lookup instead of two and the one, which is skipped is the one, which is the most likely to have a cache miss.

In short, allocating less memory results in less memory reading delays. 4% from this alone sounds good.

Also EnumMap is not complete. I realized today I can make it an array of bools, which then internally becomes an array of unsigned ints, meaning it will use 1 bit instead of 4 bytes for each element. This means work will continue.

I haven't scrutinized the implementation of the EnumMap classes – for a combination of lack of time and being a bit out of my depth.
A tip for reading is you need to understand templates. Also you need to keep track of what is known at compile time, hence allows compiler optimization. There are some switch cases where the variable to switch on is a template int parameter, meaning the compiler knows which one it will pick every single time. This allows ignoring the cases, which will never be reached and no check for the case, which is always true. In other words if the compiler does what it's supposed to do, the code will be a whole lot faster than it appears at first glance.
 
Yes. There is nothing in it, which is Colonization only. Remember Colonization is a modded BTS and the parts this relies on (savegame stream, CvBaseInfo etc) are all unchanged, meaning to a certain degree it's plug-n-play. The problem is that adding it does nothing by itself. The existing savegame code needs to be rewritten to go from the vanilla stream into the CvSavegame buffer for this to work, meaning effectively all read and write functions needs to be rewritten from scratch to get the full effect. That's not a Colo vs BTS issue, that would be an issue even if you move from one BTS mod to another BTS mod.
Okay, that's great! :)
I still wonder if it was more or less efficient in AND2, which has 2 dll-s, one with 50 civs limit and the other with 100 civs limit.
I ask it because (1) Col has drastically fewer civs and (2) I'm not a programer, so I don't understand everything you write but still like to read it :D
 
one with 50 civs limit and the other with 100 civs limit. [...] Col has drastically fewer civs
49 is drastically fewer than 50. Got it :p

The savegame format isn't about numbers such as MAX_PLAYERS because it's about how to save each. The code doesn't really care how many there are. It's no different than units and have you ever had issues saving because you had too many units?

I'm not a programer, so I don't understand everything you write but still like to read it
This is what makes it hard to write something like this. It should be readable by non-programmers while still on a level where programmers gain the intended information as well.
 
49 is drastically fewer than 50. Got it
Errr... sorry. For some reason I thought Col was limited to 16 civs... which really makes no sense on my part :hammer2:
This is what makes it hard to write something like this. It should be readable by non-programmers while still on a level where programmers gain the intended information as well.
Well, I understand what I need to: You are doing a marvelous job :clap:
 
For some reason I thought Col was limited to 16 civs
You are thinking of European players. However Colonization doesn't have a concept of a minor civ, meaning everything single civ is a full sized player. Each European player has a parent (aka the king), there are a bunch of natives and then barbarians and the pope. They all add up even though the majority aren't actually playable.
 
[...] This means one memory lookup instead of two and the one, which is skipped is the one, which is the most likely to have a cache miss.
I was expecting it to run faster simply because the water area objects take up much less space in the cache than before. But I see that it's not quite as simple – the size of the objects really remains the same.

To be on the safe side, I've repeated my test with a different seed, this time paying closer attention to the final game state being the same for CvArea with and without EnumMaps. Again, the EnumMap version was close to 5% faster (611 seconds for 250 turns on AI Auto Play versus 641.5 seconds; the earlier result was 627s vs. 656s). That's still only one test game on one platform of course (no benchmark suite).
Also EnumMap is not complete. I realized today I can make it an array of bools, which then internally becomes an array of unsigned ints, meaning it will use 1 bit instead of 4 bytes for each element. This means work will continue.
Doing that for more than 32 bools sounds challenging. Well, I look forward to seeing your solution.
[...] In other words if the compiler does what it's supposed to do, the code will be a whole lot faster than it appears at first glance.
Thanks for the pointers. Sure – you wouldn't want to branch unnecessarily in the get/set/add functions. When an array contains non-default values most of the time, the ternary operators could, on the bottomline, hurt performance, but it seems that the vast majority of BtS array members will benefit from lazy initialization (or getting collapsed by hasContent). I've done a little test with CvTeam::m_pabHasTech and m_abOpenBorders, two arrays that should usually contain non-zero entries. Even in this case, EnumMaps seem to improve performance a tiny bit. Apparently, those arrays are sometimes accessed for dead teams (and there aren't always open-borders agreements).
 
I added bool support to EnumMap. Now if the template is a bool type, the array will store data in unsigned ints. This means it will be allocated in blocks of 32 bools. A length of 35 will require 2 blocks because obviously it has to be rounded up.
This got me thinking. A bunch of xml files are shorter than 32 entries meaning it will be a one block array. I added a block to the union, meaning if if there is max 32 length at compile time, no extra memory will be allocated. Instead the array will always be allocated and it's the pointer memory, meaning performance is like it's always unallocated because it doesn't have to look up memory elsewhere.

I took this one step further and added an enum value for how many bits it will add to the "array in the pointer" and I have set it to 64. Now if 32 < length <= 64, then the EnumMap instance will take up 8 bytes instead of 4. However it will then have the always allocated and with the memory I/O boost like it's unallocated. Making the array too long will bloat the class it's inside, but I guess using the size of 2 ints instead of 1 is acceptable in exchange for the performance boost. Also it will still be only 4 bytes in all other cases. Now the question is when precisely will this code trigger... Well how about EnumMap<TeamTypes, bool> :)

It will only actually use the pointer memory for the array if it can verify that it can do so at compile time. If you do it with say RouteTypes, it won't do it unless you hardcode xml even if there are only 3 routes. The reason for this is not checking at runtime makes the code faster and it's optimized for WTP's ability to hardcode all xml files.

Not all functions work on bools. If you call say addAll(5), then it simply won't compile because what you are trying to do doesn't make sense. However due to the nature of templates those functions are mentioned in the class. There isn't really anything I can do about that.
 
[...] meaning performance is like it's always unallocated because it doesn't have to look up memory elsewhere.
This has made me curious about the generated assembly. I'm not used to inspecting that, so I've only looked at one very simple example, namely this little loop in CvCityAI::AI_isGoodPlot (a K-Mod function). Bear in mind that BtS only has three yield types.

FOR_EACH_ENUM(Yield)
aiYields[eLoopYield] = pPlot->getYield(eLoopYield);

Assembly commented with my interpretations:
Spoiler :
(with the O2, G7 and arch:SSE2 compiler options)
Code:
xor   eax,eax                       # eLoopYield=0
movsx ecx,byte ptr [edi+eax+78h]    # ecx = pPlot->m_aiYield[eLoopYield]
mov   dword ptr [esp+eax*4+10h],ecx # aiYields[eLoopYield] = ecx
add   eax,1                         # increment eLoopYield
cmp   eax,3                         # check loop termination
jne   CvCityAI::AI_isGoodPlot+73h   # begin next iteration
So the loop doesn't get unrolled, but the compiler is aware that it's exactly 3 iterations. I've given CvPlot::getYield an inline definition, everything in EnumMap is also inline, and the compiler has followed suit and eliminated those function calls. The yield values are indeed stored each as a single byte (no multiplier for eax in edi+eax+78h) inside the EnumMap<YieldTypes,char> object (constant offset from edi).
Suggestion (unrelated to the above): Having EnumMap (optionally) keep track of a total value would be handy in some cases. For example, CvArea has members m_iNumUnits, m_iNumCities and m_iTotalPopulation that tally m_aiUnitsPerPlayer, m_aiCitiesPerPlayer, m_aiPopulationPerPlayer, and I've added m_iTotalCulture to CvPlot to speed up the culture percentage computations. That would be another operation that doesn't make sense for booleans. Well, EnumMap<IndexType,bool>::getTotal could, in theory, count the true values.

Is this code fragment from hasContent correct? (I guess GitHub would be a better place for such a question; I'm piggy-backing it here for my own convenience.)
Spoiler :
Code:
if (bINLINE_BOOL)
{
   for (int i = 0; i < NUM_BOOL_BLOCKS; ++i)
   {
       if (m_pArrayBool[i ...
I had to change the union access to m_InlineBoolArray to get it to work. It also suprises me that this rather large function has an inline keyword.
 
Last edited:
Is this code fragment from hasContent correct? (I guess GitHub would be a better place for such a question; I'm piggy-backing it here for my own convenience.)
Spoiler :
Code:
if (bINLINE_BOOL)
{
   for (int i = 0; i < NUM_BOOL_BLOCKS; ++i)
   {
       if (m_pArrayBool[i ...
I had to change the union access to m_InlineBoolArray to get it to work. It also suprises me that this rather large function has an inline keyword.
This is already fixxed in my fork and some other things.
https://github.com/clonkia/Mod/commit/c15cd8f43dded8b40f26cf92598a1c5ae41ce542
 
So the loop doesn't get unrolled, but the compiler is aware that it's exactly 3 iterations.
Stupid compiler. It's apparently unaware that it would be beneficial to unroll loops. We should complain to Microsoft :lol:

We should upgrade the compiler to something, which isn't stupid. The problem is that the exe and dll needs to use the same version of memory allocation or we will end up with weird bugs. I'm still pondering on how to get the 2017 compiler use the 2003 memory allocator.

Having EnumMap (optionally) keep track of a total value would be handy in some cases.
When I googled MSVS2003 loop unroll, I came across something unrelated, but interesting for this. getTotal could be calculated in an unrolled approach regardless of what the compiler thinks about it.

It could also be possible to have a wrapper class as in it inherits EnumMap, but it also stores an int. It then has the set functions, which updates the int whenever something is changed. Getting the total is then as easy as reading the int.

Which approach is best depends on how often you need to total amount.

Is this code fragment from hasContent correct?
Copy paste error. I had it working at some point, but apparently broke it without noticing it. I have been thinking about writing automated checks to prevent the need for manually testing everything for every single change. Had I done that, then I would have caught when I broke this.
 
This is already fixxed in my fork and some other things.
Thanks for the confirmation. Good to know that Nightinggale doesn't have to convert all the WtP code by himself; a ton of work even for two people.
We should upgrade the compiler to something, which isn't stupid. The problem is that the exe and dll needs to use the same version of memory allocation or we will end up with weird bugs. I'm still pondering on how to get the 2017 compiler use the 2003 memory allocator.
alberts2 has mentioned moving parts of the GameCore code into a new DLL. This post of his (and the rest of that page) is pretty much all I've read about that plan.

Or we could just wait for MS to release the source code of the 2003 compiler someday. :P
When I googled MSVS2003 loop unroll, I came across something unrelated, but interesting for this. getTotal could be calculated in an unrolled approach regardless of what the compiler thinks about it.
That's all I need – more template parameters.
It could also be possible to have a wrapper class [...]. Which approach is best depends on how often you need to total amount.
Having a getTotal function at the base class sounds useful in any case. That function would recompute the sum on every call – maybe using template meta-programming (not if I had to write it). And then the client code has to say whether to instantiate a wrapper class that stores the total.
I have been thinking about writing automated checks to prevent the need for manually testing everything for every single change.
A unit test for EnumMap shouldn't require anything from the Civ 4 EXE, so one could create a new EXE for running tests. Then again, test code for other classes might require the Civ 4 EXE to be loaded, so perhaps it's better to run all tests after loading Civ 4 and the mod and maybe even a special savegame.
 
So the loop doesn't get unrolled, but the compiler is aware that it's exactly 3 iterations. I've given CvPlot::getYield an inline definition, everything in EnumMap is also inline, and the compiler has followed suit and eliminated those function calls.
I have been thinking a bit about this and it's actually not as bad as it seemed at first. The inlining works. The loop stop condition is a simple comparison against a fixed number rather than the return value of a function call. This means it will be faster even if the branch prediction fails. However the stop condition is fairly simple, which makes me wonder: will the branch prediction in the CPU be able to figure out it will execute 3 times? Maybe. At least the code will give the CPU a much better chance at being clever here compared to what it can do with vanilla code.

It also suprises me that this rather large function has an inline keyword.
That's because the inline keyword doesn't do what most people assume it to do. It's a recommendation, not an order. This means the compiler will consider inlining the function, but it is allowed to not do so. Also (particularly true with modern compilers), the compiler is free to inline even if the keyword is missing. This means the value of the inline keyword for performance isn't what you would expect at first glance.

If the function isn't inlined, then it will be compiled into every object file. The linker will then detect multiple cases of the same function and fail to link. However if the functions are declared inline, the linker will then detect them to be identical and merge them into a single instance, hence acting like it's a single instance from a cpp file.

Adding template functions in a cpp file means the cpp file has to be aware of which templates to compile for. In other words the programmers needs to manually maintain a list of possible templates for EnumMap if the functions are in a cpp file.

Placing all the functions in the header will avoid the need for manual handling of which types to use with EnumMap and declaring them inline will solve the linker issue. Maybe they will be inlined, maybe not. That's up to the compiler.

There is also another issue to consider when looking at inline. A lot of the big functions have if-else structures (if not downright switch case) where the branch condition is set at compile time. This means a lot of the code is compiled and then removed by the optimization. In other words it's possible that less than 10% of the lines in a function will actually be used.

In this specific case adding a template function seems like a good idea if we want to optimize further. In fact for the hardcoded length arrays using templates to unroll the loop appears to be an option.

A unit test for EnumMap shouldn't require anything from the Civ 4 EXE, so one could create a new EXE for running tests. Then again, test code for other classes might require the Civ 4 EXE to be loaded, so perhaps it's better to run all tests after loading Civ 4 and the mod and maybe even a special savegame.
I was thinking something more simple, like calling a function after loading the xml data. This function can then declare an EnumMap, then asssert(get), set, assert(get), reset, assert(get) etc. Add some code to make it compile only in debug mode to skip it in releases and EnumMap will be automatically tested every time somebody starts a debug build. If the tests can all be completed within a second, then it doesn't matter they are tested this frequently. In fact it would be a bonus.

That's all I need – more template parameters.
The idea is that you don't have to consider the templates. It should be like get() calls _get<T>(). The "outside world" won't know it's a template. Sure it's an extra function call, but all it does is returning the return value from another function call. It has a really good chance of being inlined, hence vanished from a runtime perspective.

For the time being I won't use templates to optimize further. Instead I will write a lot of test code. That way when I get around to template optimizations, the tests will instantly tell if I mess up, be it logical error or copy paste error (again). It doesn't matter how fast the code is if it is buggy, hence the need to be bugfree comes first.

Having a getTotal function at the base class sounds useful in any case. That function would recompute the sum on every call – maybe using template meta-programming (not if I had to write it). And then the client code has to say whether to instantiate a wrapper class that stores the total.
I think I will write this in a simple (slow) way, then add a bunch of tests. Only after that should fancy fast code be considered.
 
I added getTotal: source (note: new branch)
It's using the template approach to unroll loops. Not only will it skip loop overhead, it will also place the code in a way, which gives the CPU a better chance of doing some decent out of order execution, meaning instead of adding the numbers one by one, it can reorder to get some sort of single threaded parallel execution. Since CPUs have issues increasing in clock speed, each generation boosts performance by allowing more and more instructions to be executed in parallel meaning CPUs today benefit much more from this unroll approach than those used back when the compiler/civ4 was released.

If the length isn't known at compile time, then the code will use a runtime loop. It might seem obvious, but the nice part about this is that EnumMap handles this internally. All you have to do is to call getTotal() and the code/compiler will figure out how to compile the code matching your specific case.

I wonder if setAll and addAll should get the same treatment. Particularly setAll might benefit overall because it's used by the constructor, memory allocator and reset. In other words it's used frequently even if not called directly. I wonder if there is a good way to measure the performance in this. I mean other than profiling because profiling isn't the greatest approach when it comes to something like this. While good overall, it has limitations when it comes to something like function call overhead and other minor slowdowns, which can combine into major slowdowns when used enough times. It also has issues with multithreaded applications, but that's not a problem here.

Also in case you wonder what test code looks like for a mod, here is what I came up with. It might not look elegant, but it did catch a few minor issues, meaning it gets the job done just fine.
 
Back
Top Bottom