Multithreading & 64bit memory access to increase Civ4's speed in large games?

Blake00

CFC Mod Archivist & Social Media Helper
Moderator
Supporter
Joined
Sep 24, 2016
Messages
2,640
Location
Australia
Hey guys,

Apologies if this is a dumb question and I've missed something obvious in my searches but the topic came up in conversation elsewhere and I remember people using the 64bit 4GB patch tool on Civ4 (EDIT: Steam & GoG copies come pre-patched for 4GB & don't need this) and getting increased performance (& possibly less memory allocation crashes) however I'm not sure what's going on these days with the old issue of slow AI turn times in large games of Civ4 due to no multithreading support?

I remember speaking to @raystuttgart a few years ago about the amazing Multithreading mod developed by gurus @devolution & @Nightinggale for the awesome Civ4Col mod We The People and being the Civ4 engine it could be possible for someone (super talented) to 'back port' it to good old regular Civ4 BTS as well. I had a look to see if there's been any news on that front here on the forums but had no luck, although I remember Ray saying the work happens offsite so I'm wondering did anyone end up taking on the challenge of bringing all that amazing work over to regular Civ4 for the fanbase to enjoy?
.
 
Last edited:
... to 'back port' it to good old regular Civ4 BTS as well.
There is no easy way to "backport" what we did in WTP to Civ4BTS. - At most it may be used as "blue print".
It would still alwas need to be a game / mod specific implementation of its own and very little could be "copy&pasted".

1) Some logic that uses Multi-Threading in WTP like e.g. Professions (to produce Goods in Production Chains in City) does not even exist like that in Civ4BTS.
(In WTP there are over 100 different Types of Yields that could be produced, while Civ4BTS basically just knows a few like e.g. Food, Hammers, Gold, Science and Culture.)

2) Other logic like e.g. Pathfinding might more easily be adapted buts also works quite differently in WTP, because we have more complex movement rules.
(There are many more Terrains, Yields and much more complicated rules for Units to be allowed to move e.g "Large Rivers" and "New Movement System".)

--------------

Other comments:

It is necessary to understand that not everyting can be changed to use multi-threading but only very selected parts of the logic.
It is a difficult discussion honestly where to invest the effort to implement MP and where it is not worth it or even better not to do it.

The logic that is really heavy on performance in Civ4Col is not necessarily the same as the logic that is really heavy on performance as in Civ4BTS.
Some stuff can be parallelized but doing it wrongly may even slow things down. It really needs detailed analysis of the code and performance measurements.

Basically you need to work quite iterative:

1) analyze the code to find logic that may be worth using MP
2) implement MP for some section of the code and make it stable
3) run performance tests (using e.g. autoplay) to see if it works
4) change / optimize parts of it and check the difference

--------------

Somebody would have to do a completely new implementation for Civ4BTS as he might optimize completely different logic in a different way.
But he of course could also use and interrate TBB and thus take a look into our code to learn how Multi-Threading with TBB can be done.

In other words:
Civ4BTS might use the same framework / technology but would need a completely different implementation.

----------------

Comments on effort:

100 hours
for a team of skilled modders (including implementation and test) to get a notable difference.
Getting a similar improvement with e.g. optimization of caching and static coding may take less time and less skill.

--------------

Other comments:

1) We have been doing performance optimizations on WTP for years and using Mutli-Threading is only one method we used.
There may actually be many other ways of performance that may be more efficient for your mod and worth trying first (e.g. optimization of caching).

2) Even if Multi-Threading is used, there is no way that this game will ever use 64Bit, as the engine and exe will simply never allow it.
So there is some technical limiation to performance improvement in this game. I estimate that max 50% less turn time in endgame is possible.

So yeah, it is possible to get endgame maybe twice as fast. (In case you have a fully optimized code.)
But to do so you really need to combine several different coding techniques and not just Multi-Threading alone.

---------------

So yeah, Multi-Threading is worth it, It can definitely improve performance in endgame notably.
Once other easier optimizations have been done I would go for it if you have time, motivation and skills.

However it is not a magical solution for all performance issues as some people may hope.
Other performance improvements are usually less effort and less risk and thus usually tried first.
 
Last edited:
I remember people using the 64bit 4GB patch tool on Civ4 and getting increased performance (& possibly less memory allocation crashes)
That's the first I see of this being related to performance. I through it was purely to avoid crashes. Maybe it limits memory fragmentation, but I'm not sure about that. Steam and GOG comes pre-patched for 4 GB.

2) Other logic like e.g. Pathfinding might more easily be adapted buts also works quite differently in WTP, because we have more complex movement rules.
(There are many more Terrains, Yields and much more complicated rules for Units to be allowed to move e.g "Large Rivers" and "New Movement System".)
If we ignore WTP/Colonization specific code, pathfinding is the only multithreaded part of WTP, which would have a chance of being possible to copy to BTS. It's a modified version of K-mod pathfinding and it exist in a standalone file. I will not rule out that it can be copied as a file and then it might work.

However as already mentioned the performance boost might not be copied. Multithreading something where each thread has too little to do might backfire due to threading overhead.

Oldschool singlethreaded optimization might be a better candidate to spend time on. There is one statement regarding that with pathfinding here:
 
Thanks for the explaining the situation in detail here for myself and everyone else guys. Just what I was hoping for so much appreciated! :)

So yeah, Multi-Threading is worth it, It can definitely improve performance in endgame notably.
Once other easier optimizations have been done I would go for it if you have time, motivation and skills.

However it is not a magical solution for all performance issues as some people may hope.
Other performance improvements are usually less effort and less risk and thus usually tried first.
So in the end not much of it is transferable to regular unmodded Civ4 but that possibly some parts of Civ4 could be optimised by someone starting over using your code/blueprints as an example. However it'll be a hell of a lot of work haha!

That's the first I see of this being related to performance. I through it was purely to avoid crashes. Maybe it limits memory fragmentation, but I'm not sure about that. Steam and GOG comes pre-patched for 4 GB.
Hmm yes perhaps performance wasn't the right word as that some people might take that as gameplay whereas it seemed to be improved loading times people were praising in that thread. I didn't realise Steam & GoG copies were prepatched for 4GB, so I guess that solves this one anyway as most players presumably made the switch due to that windows update annoyingly breaking the disc version lol.

If we ignore WTP/Colonization specific code, pathfinding is the only multithreaded part of WTP, which would have a chance of being possible to copy to BTS. It's a modified version of K-mod pathfinding and it exist in a standalone file. I will not rule out that it can be copied as a file and then it might work.

However as already mentioned the performance boost might not be copied. Multithreading something where each thread has too little to do might backfire due to threading overhead.

Oldschool singlethreaded optimization might be a better candidate to spend time on. There is one statement regarding that with pathfinding here:
Interesting.. so there's a few possibilities/directions someone could pursue.
.
 
Interesting.. so there's a few possibilities/directions someone could pursue.
Of course there are possibilities to improve performance. Lots of things could further be pursued.
But it will also be very mod specific solutions to the nature of Civ4BTS / Civ4Col modding.

-----------

The topic "Multi-Threading" is not new in Civ4BTS / Civ4Col modding. It is not about knowledge.
The top programmers have discussed it for years already and literature for programming is widely spread.

There is one simple reason why it has been done only once so far:
Lack of skilled active modders with enough time and motivation. :dunno:

-----------

To do stuff like that you more or less need:

1) Programing skills on (semi-)professional level
2) A motivated team with enough time and motivation

Both of these things are rare these days as active modding community is in decline ...
But if you have both, almost all is possible in Civ4BTS / CivCol modding.

------------

Once a team of skilled modder joins forces again to work on a joint big project we may see great things happen ...
Until then, let us be happy that we at least get a few fixes, improvements and minor updates for the few mods that are still alive from time to time.
 
The topic "Multi-Threading" is not new in Civ4BTS / Civ4Col modding. It is not about knowledge.
The top programmers have discussed it for years already and literature for programming is widely spread.

There is one simple reason why it has been done only once so far:
Lack of skilled active modders with enough time and motivation. :dunno:
There is more to it than that. All civ4 based games are single threaded in nature, which was fine when it came out as most CPUs only had one core. Multi threading is usually planned into the core design and usually isn't a later addon. Turning existing single threaded code into multi threaded is often much harder than writing multi threaded code from scratch. The nature of how networked multiplayer works makes certain parts of the code impossible to run in more than one thread because it requires the end result to be the same regardless of which thread finishes first.

Take for instance pathfinding. It can test multiple paths in parallel and then pick the best one. Doesn't matter which one is tested first as the outcome won't affect the other threads. Now imagine running CvCity::doTurn in parallel. Will the outcome depend on which order the cities are run? Yes for starters there is the issue of money, but also I believe there is some code to keep track of how many cities produce which unit to avoid all of them from producing the same (at least for this example let's assume there is such a cache). If the AI wants to produce 4 units and if 4 is under production, it will produce a different unit. Now which cities will produce the first unit will depend on which threads gets to pick production first. Since we can't control which thread runs first as they run in parallel, the different computers will end up going out of sync regarding what to produce. In order to have code, which is well suited for multi threading, the different part of the code should be able to run independently from each other. Vanilla code feels like it is aiming for the opposite, which means adding multi threading can easily be a question of rewriting the affected code from scratch.
 
Some fantastic information and context in the replies above! I love parallelizing my own code, but in a new code base, in a language I haven't used in years (and never in my day job), and with multiplayer synchronization to worry about? Yeah, that is a lot to consider!

The other context I'll add for the less nerdy is what was alluded to about only certain areas being able to be multithreaded. Let's say of the AI turn time, 40% of it inherently cannot be multithreaded, for various reasons but Nightinggale illustrated one example. Then the maximum amount you can possibly speed up the AI turn time is to take it to 40% of its previous amount of runtime - assuming you have an infinite number of processors and no overhead to run what can be parallelized. This is called Amdahl's Law in computer science. If you want better than that improvement, you'll have to make other optimizations as well or in addition to adding multithreading.

And the theoretical benefits don't always pan out. I remember making an optimization that in theory I expected would speed things up, but due to overhead, when measured, it actually slowed things down. That's part of why having someone experienced and with the proper tools and patience is important; it's all to easy to add optimizations that aren't actually optimizations in practice.

This kind of makes me want to dive into this and see if I can make anything happen, but I already have enough other side projects, and C++ is one of the few languages I've programmed in where I've never attempted to parallelize code...

64bit 4GB patch tool on Civ4 (EDIT: Steam & GoG copies come pre-patched for 4GB & don't need this)
It should be noted that this is in fact a patch to allow 32-bit programs (such as Civ IV) to use the full 4 GB of memory available in the 32-bit address space on 64-bit Windows systems, versus the 2 GB limit that Windows enforces otherwise. But it's required to be on 64-bit Windows to make use of it (something I hadn't been aware of until today, probably because I ran XP x64 back in the day and had assumed it worked on all XP variants).
 
(& possibly less memory allocation crashes)

One way to reduce memory allocation crashes (due to memory leaks) is to use an inexpensive commercial product called Process Lasso from Bitsum. It runs in the background and monitors/optimizes all the process that are running on your PC. It manages the priorities to keep background processes from hogging the CPU from foreground processes.

One feature it has is a thing called Trim. It will execute a Trim to have certain processes release memory at designated conditions. The other feature is Process Watchdog. This is a tool to monitor a specific process and take certain defined actions when defined conditions occur.

For Civ 4, I have a Watchdog set to Trim the working set memory whenever it exceeds 825Mb. I have not had an in-game memory problem ever since. The only problem is when I try to load a saved game from a running game. It crashes me to the desktop, but then I just restart and load the saved game fresh.
 
A 64-bit multi-threaded Civ4? An idea is to "simply" re-implement the Civ4 engine and game DLL, and port the scripts to python 3 I suppose. A lot of work, but Civ4 is a very open game. Simple file formats, the DLL's source code is right there. It's surprising a re-implementation hasn't happened yet. Get all those Civ4 quirks fixed. Unit UI, the slowdown, turn times.
 
A 64-bit multi-threaded Civ4? An idea is to "simply" re-implement the Civ4 engine and game DLL, and port the scripts to python 3 I suppose. A lot of work, but Civ4 is a very open game. Simple file formats, the DLL's source code is right there. It's surprising a re-implementation hasn't happened yet. Get all those Civ4 quirks fixed. Unit UI, the slowdown, turn times.
So your simple solution is to rewrite the game from the ground up? Honestly, as someone who maintains a commercial multithreaded application as a day job I can tell you that this is pretty much what would be required.

As others have already said multithreading is hard. You need to structure your entire application around algorithms that parallelize neatly. If you don't you just end up with large chunks of synchronous code interspersed with bursts of parallelism. And that can easily lead to WORSE performance overall because you are paying for the synchronization costs both visible and invisible. And it must be done carefully lest you introduce new bugs through parallelism that are incredibly difficult to debug because your code literally contains undefined states thanks to the parallelism it self.

So yea, multiprocessing is an architectural decision and not a programming one. Trying to add it to an existing project and achieve meaningful results is akin to trying to bolt on a plow to a formula 1 race car. Yes you can technically do it. And you might even be able to drive it around a field somewhere. But the technical challenge of actually turning it into a functioning agricultural machine are going to be... cosmic.
 
I would imagine that you could get a decent performance improvement by just writing the same architecture better. I had my end game of 200 cities, no unit movement, no AI, but turn times were far from instant. I can only wonder what's going on. Maybe it's unavoidable. Maybe it's massive python overhead.

Rewriting an engine is something I would be interested in, but I already have a bunch of things to do. I have thrown it into Ghidra to find out how it interfaces with the DLL.

I was pondering the idea of playing with Reinforcement Learning, and for that, you need a fast engine because you need lots of data points. So I was going to port the DLL and all used python code to modern C++.
 
I would imagine that you could get a decent performance improvement by just writing the same architecture better.
I would like to apologize if I say something outrageous as I know nothing about coding but could inspecting code with chatgpt4 yield any improvements? If an expert does it, would he be able to access the suggestions of chatgpt quickly enough as to whether they make any sense? If they do, it probably should be easier to implement them as it is capable of writing code by itself and generate optimized code?

I was pondering the idea of playing with Reinforcement Learning, and for that, you need a fast engine because you need lots of data points.
Do you mean you'd like to use it for AI so it would have more agency and flexibility? Akin to the AlphaGo where it came up with novel tactics? Man, I would like to play this version of Civ... It would put AI one much more even ground and allow to cut the insane bonuses it gets from higher difficulties. It would probably discover all these failgold techniques quickly enough and will need to be restrained to keep the game interesting. I would like to see with what kind of novelties it would be able to come up with which players had not discovered.
 
I would like to apologize if I say something outrageous as I know nothing about coding but could inspecting code with chatgpt4 yield any improvements? If an expert does it, would he be able to access the suggestions of chatgpt quickly enough as to whether they make any sense? If they do, it probably should be easier to implement them as it is capable of writing code by itself and generate optimized code?
I'm no expert but I strongly doubt that ChatGPT could decipher it. To my understanding the difference is like someone (ChatGPT) whose reading skills is enough only to understand English comic strips wants to read and write Thai.
(Feel free to correct me if I'm wrong)

Do you mean you'd like to use it for AI so it would have more agency and flexibility? Akin to the AlphaGo where it came up with novel tactics? Man, I would like to play this version of Civ... It would put AI one much more even ground and allow to cut the insane bonuses it gets from higher difficulties. It would probably discover all these failgold techniques quickly enough and will need to be restrained to keep the game interesting. I would like to see with what kind of novelties it would be able to come up with which players had not discovered.
Actually there is an "ultimate solution" for the superior AI - that works with chess - which is calculating every possible step and choosing the best outcome. Obviously this wouldn't work in practice with Civ4 - unless you are willing to wait days between turns OR you have a positronic computer from the 24th century :lol:
 
It would be one heck of herculean effort for ChatGPT to figure this stuff out. Maybe it could give you some ideas just like any other investigative effort. As always, ChatGPT is a statistical machine, not an oracle.

And the engine would need reimplementing anyway, and you don't have the source code for that. Maybe you can feel brave and apply ChatGPT to reverse engineering.


Trying to do RL with Civ has been done before. There's https://people.csail.mit.edu/camato/publications/LearningInCiv-final.pdf. There is also a freeciv learning environment: https://github.com/yashbonde/freeciv-python.

Whether RL can invent new tactics will depend on how you implement it. There's the Starcraft AI written in python that is rule-based. So the AI can only do what you define is possible. Then you slowly replace the decisions within those rules with deep learning. Convolutional neural networks are used for spatial information.

On the other end, there's the gameboy pokemon AI that handles inputs and the screen directly, so anything is possible. That would be less practical for a more complex game.
 
Rewriting an engine is something I would be interested in, but I already have a bunch of things to do.
Same here. Judging based on my current modding progress I really question if I should take on even more tasks, particularly one this huge. It would allow using a modern compiler using modern C++ and modern optimization etc so if the game is too slow it could be tempting. It's probably not the most time efficient optimization through (from a programming hours point of view).

It would be one heck of herculean effort for ChatGPT to figure this stuff out. Maybe it could give you some ideas just like any other investigative effort. As always, ChatGPT is a statistical machine, not an oracle.
My guess is that GitHub CoPilot would be a better choice than ChatGPT. Fundamentally they build on the same idea, but CoPilot is trained on code specifically and aims at helping programmers. I have never tried it through so I have no idea how good it is.
 
On the other hand, it should be quite simple, relatively. You can port the DLL to modern C++ with minimal modifications. Just get it compiling.

Then, you figure out exactly how the engine interfaces with it. I've used a DLL logger "API monitor" for this, but the engine makes soooo many redundant calls. And they're DLL calls, the compiler will never optimise them away. Another way to log calls is to grab VS2003 and add logging that works with the existing engine.

That step is the critical step. Once you have the DLL compiling with latest C++, and you know how the engine interfaces it, you can then make a new engine with minimal further RE, I guess. Just bunch of UI and asset loading, right? At least 99% of game mechanics are in the DLL. There will be a few things left like the AStar implementation, which is in the engine.

Once that's working, then one would verify it's correct. Save file comparisons. Then look for sources of performance problems with a profiler. Maybe try reimplementing the DLL entirely with known information about the mechanics, and make it efficient. That would be the next big task.
 
I like the ideas snowern is bringing up. Post #16 outlines a potentially plausible (still not easy) way to add multithreading where it isn't there already. Assuming all the substantial AI (turn times) is in the DLL, then if you can determine its endpoints, and then profile it to figure out where the bottlenecks are, then you can start making a replacement DLL - ideally initially using the original DLL's functions for the areas you haven't optimized, so you don't have to replace everything at once.

Hopefully, the profiling phase would point to some functions that are relatively easy targets for parallelizing. I've had some old code that was a natural fit for tacking on parallelization later (e.g. a map import script for Civ III, which could be done per-row in parallel), and others where there was no obvious way to parallelize it. Although it's also possible that there are algorithmic improvement opportunities that don't require parallelization.

Easier said than done, and I have too much going on to want to tackle it myself. But intriguing. The "figure out something meaningful from the DLL" part is where I stumbled when I was looking at this for Civ III, but that was in the days before Ghidra, just IDA. Even if it never progresses beyond "here's how to use Ghidra to evaluate the DLL and add some logging/profiling, and here we can see some bottlenecks", I think that would be a very interesting write-up to read.
 
On the other hand, it should be quite simple, relatively. You can port the DLL to modern C++ with minimal modifications. Just get it compiling.

Then, you figure out exactly how the engine interfaces with it. I've used a DLL logger "API monitor" for this, but the engine makes soooo many redundant calls. And they're DLL calls, the compiler will never optimise them away. Another way to log calls is to grab VS2003 and add logging that works with the existing engine.
Some work has been done in this regard in WTP, but it's not usable yet, nor do we know if it will ever end up working as intended.

There are more than just "minimal modifications" needed to get the DLL to compile. C++03 allows a lot of questionable code, which are no longer allowed in modern C++ as it's prone to bugs. Still it's mostly a question of compiling, get error, fix error, compile etc, so mostly doable.

The real issue is the exe calls. The exe exist in 2003 address space and modern compilers don't. This means if the exe is going to be compatible with something compiled on a modern compiler, what is essentially needed is a layer, which converts between pointers and indexes. Making calls by value between compiler versions is allowed, so it's allowed to make a call, which has first argument unit 417(int), but it's not allowed to make the same call using a CvUnit pointer.

When using two memory domains (in lack of a better term), each will have a 4 GB limit and our new one can even be 64 bit if needed, but it's likely faster to keep both as 32 bit.
 
Getting the DLL to compile in modern C++ is something I've done. To some extent at least. I even got it to initialise, generate a map with C++-ised python code, and execute a few turns with it. But some desync occurred, the game started to diverge from original Civ4. Maybe I did something wrong in the DLL, or the C++-ised python code, or maybe I wasn't calling the DLL correctly.

Figuring out exactly how the DLL is interfaced with will be important. That may require more RE of the exe or a good logging system. It's not a simple interface. There's a lot going on.

what is essentially needed is a layer
If you want to do it the other way... you could write a new engine first in x86 and have it use the old DLL. You "just" need to come up with a VS2003 C++ stdlib to use with the DLL, I guess. For that extra hacky programming experience. Would need the VS2003 string type at least.
 
Okay, so I convinced myself...

How to compile the (x86) DLL in VS2022:
  • Grab the DLL's source code. Make a solution out of it. No need to include all of Boost in the project.
  • Setup the precompiled header.
  • Define these common macros: WIN32;FINAL_RELEASE;_USRDLL;_CRT_NON_CONFORMING_SWPRINTFS;_CRT_SECURE_NO_WARNINGS;_SILENCE_STDEXT_HASH_DEPRECATION_WARNINGS;_WINDOWS
  • Define these macros for debug: _DEBUG;_ITERATOR_DEBUG_LEVEL=0 (iterator debugging will change the size of containers)
  • Define these macros for release: NDEBUG
  • Pick C++14, or c++latest with _HAS_AUTO_PTR_ETC.
  • Remove register keywords.
  • Use /Zc:forScope-, or fix the code.
  • Enable Windows heap debugging in VS debugger settings. Watch out for BAADF00D.
  • Replace the calls to malloc/free with calls to msvcr71.dll in CvGameCoreDLL.cpp.
  • Replace all uses of std::string, std::wstring, and std::vector with your own type aliases that use a 4-byte allocator class. These containers are ABI-compatible except for padding in the VS2003 versions.
  • Replace std::list with a type alias that uses the 4-byte allocator class. It's part of the interface of CvPlayer and CvCity.
  • Implement your own std::map, just for CvDllTranslator. The node structure was changed, so you can't just pad the allocator this time. At least you only need operator[].
  • Remove implicit conversion operators in CvString and CvWString. They are far too dangerous to be left around and will trigger memory corruption in bad ternary expressions. Fix the massive amount of code that used them.
  • Write a python script to generate a module def file.
    • Grab a list of delay load imports from Ghidra.
    • Grab a list of your unmodified exports from DLL Export Viewer.
    • Use heuristics to match them up. Notably, wchar_t in VS2022 is unsigned short in VS2003, and you'll need to replace references to your special allocator type.
    • In the end, this means that std71::vector<wchar_t> in your code may actually be a std::vector<unsigned short> from Civ4.
    • If Civ4 fails to delay load a function, the exception info is a pointer to a structure that contains a pointer to the function name.
  • Tell VS to copy the DLL into the Civ4 installation after building.
  • Build it and get used to debugging mysterious crashes.
Once you've got a DLL working, you can then debug and profile within the comfort of VS2022.

2024y04m19d - Civ4 DLL in VS2022 (flamegraph profiling).png

This was my 200 cities end game save. No unit movement, no AI. A chunk of time in turns was spent by cities reevaluating plots it seems. Sure enough, I turned off citizen automation and halved my turn times.

I think you'd also get some general improvements from better optimisation.
 
Top Bottom