Mini-engine progress

Many games have got reimplementations. But Civ4 will be different from most as you have the DLL source available.

Other companies don't even like having their games reverse engineered but maybe Firaxis doesn't even care about having their source code used in a project like this. Requiring players having the original assets because they own the game would certainly help to avoid trouble. Once this becomes a standalone game things could get heated.
 
Ordinarily, original assets are required, but because this is a console implementation, art assets are not needed at all (except optional sounds). You could in theory provide your own XML info files and scripts.

A reason to rush towards a graphical implementation I guess.
 
Turn times are now 2-3s for my three ultra huge "early"-game test cases (FASTER!)

But, I've reached the bedrock of optimisation where AI_foundValue is taking up 43% of time on the main thread for save Giant Pangea begin T209.
2024y11m27d - AI_foundValue.png

A lot of little things are threaded now. The two bumps are normal players calculating found values, and then the barbarians. Or the other way around.

Unfortunately, I don't think there's anything I can do about that. AI_foundValue is a big complex function that can't be incrementally updated easily. At least barbarians don't found cities all the time.

But other things can be incrementally updated. Land targets, stolen city plots. Those have disappeared from the profiler.

Here's the flame graph for Giant Pangea begin T200, which only takes 1.8s:
2024y11m27d - Flame graph.png

Which is odd, as it's the save that has 3K barb animals. But you can see AI_animalMove.

Exploration is another big improvement. Instead of looking at every plot, AI_explore and AI_exploreRange will perform a localised search that should stay reasonably close to original mechanics. AI_explore is an interesting case, as to avoid searching the whole map, you can exploit the behaviour of its value function to know when to stop a local search (in most cases).

A specific case is Charlemagne's 627ms work boat that could not explore anywhere because it was stuck between barb galleys. But it would still pathfind all over the map, eating precious CPU cycles in AI_plotDanger, defeating my plot danger early-out code because there's 800 barb galleys.

Still haven't finished an actual Huge+++ map yet. Really should get around to that, to find the rest of the slow paths, check things work.
 
Not only is this a cool project with huge potential for the future, it's also a fascinating look into what's going on deep under the hood of the game.
 
Are you using the vanilla BTS source code or a modded version?

Edit:
Have you tried tried using Clang as Compiler?
 
Last edited:
Are you telling that 43% of time game is spending on AI trying to calculate the value of founding a new city? That's ridiculous)
May be this function is too complicated to optimize itself, but we can optimize how often it is called? How many times this function has been called at this save? Looking at signature - it is called for each plot. But I assume it is called only for plots that are achievable by that AI to settle, only when it has settlers, and only once for each plot for each AI. If any of this is not true somehow ( i have seen coding like this!) - it could be a huge improvement)
 
Are you using the vanilla BTS source code or a modded version?

Edit:
Have you tried tried using Clang as Compiler?
It is the original DLL code, with modifications to make it work for a modern compiler and the new engine. The mechanic-changing performance modifications are enabled by ENABLE_FASTER_ULTRA_HUGE_MAPS.

Compiling on Linux/WSL is something I hope to do. Code will need to be made portable, and I will have to figure out CMake. It would be a kind of fun. Should be faster on native Linux. On Windows, a lot of time is spent just printing to the console.

May be this function is too complicated to optimize itself, but we can optimize how often it is called? How many times this function has been called at this save?
Yes, from a second look, it doesn't seem all that bad. It could probably be called less, as recomputation appears to be unconditional right now. And most of the function could be locally updated. But it has global dependencies, like the number of cities, and the nearest city.
 
The pathing was fast because it was wrong!

Good thing I had this verification code lying around to find out. I just didn't use it because it was slow.

:cry:

The caching system was wrongly caching some pathing state objects that depended on destination coordinate, which is a minor problem (I can probably just ignore that for long range pathing). And some pathing state objects did not invalidate when "max turns" changed, which had the effect of aborting long range pathing queries early because max turns increased.

Now, I have pathing state cache verification, which is much more rigorous, and pathing state reuse no longer depends on max turns.

Also, the coarse heuristic is inconsistent, which makes A* inefficient.

Back to the drawing board! I still have a few tricks up my sleeve...

I wonder if the Factorio heuristic is consistent. It is likely admissible, but consistency is important too. Even if my heuristic is admissible, it may be picking different estimations, causing the heuristic to jump about in an inconsistent way.

Below looks like a 30% visitation, but the logs say 51% of land was visited... the varying opacity of the overlay indicates an inconsistent heuristic.
2024y11m27d - Inconcistent heuristic.png
 
Hi,
any pre Christmas news ? :)
Oh, I'm just stuck in pathfinding hell at the moment.

How do you make pathfinding fast? There are many choices.
  • Jump Point Search - A well-known method, but it's for uniform grids.
  • Weighted Jump Point Search - This could work, but is greatly complex and might not be all that fast because of how noisy and scattered features on the map tend to be.
  • Rectangle Symmetry Reduction - Simply partition the map into uniform cost rectangles and reduce the branch factor with symmetry reduction. This might get you a speed up, but might not really be enough. Plus, it might get worse in the late game as the AI leaves holes in its road networks.
  • 8-DOP/General Convex Symmetry Reduction - I could extend the idea to increase the effectiveness of the method, but too complex for the improvement I think you'd get.
  • Exact Heuristic - How this would work is you divide the map into blocks, and then compute the shortest path between every pair of blocks. This would be great, except that precomputation and updates would be slow. It might actually be good enough in practice, though, if you pick a large block size.
  • Bidirectional Search - This seems like a reasonable idea in the face of dijkstra-like search. You will get some very very slight suboptimality though.
  • The Factorio heuristic - I've found some info in a forum post and it's simply (number of chunks to goal) * (chunk size) + (shortest possible distance to the next chunk). It's probably neither admissible nor consistent, but must be good enough for them. I might not use it for the heuristic, but I could use it to guess the shape of the path...
  • Landmarks/Differential Heuristic - Near-exact. You just place a couple of landmarks around the edge of an area, coastal plots, and do dijkstra for the whole area. Then your heuristic is a couple of table lookups.
    But, because the heuristic covers the whole map, any route or feature change will invalidate the heuristic. Lifelong Planning A* could be used to minimise the amount of updates. The worst case is the AI building a road near a landmark, which could have cascading effects. Another improvement is to cheat: allow the use of a stale heuristic if the expected "path shape" does not go near changes. This may create slightly sub-optimal paths.
  • Just make it faster - Some comment somewhere said that you can do 512x512 dijkstra in less than 1ms. Of course I was interested. It turns out that with great optimisation, I couldn't do it faster than 23ms (~100ns/node). But if you do it on a uniform grid, it turns into BFS and you can do it in 1.6ms (~6ns/node). Maybe that's what happened. It wasn't dijkstra at all.
  • Hierarchical A* - The tried and true method. What you can do is exact pathfinding up to X blocks, and then switch to hierarchical. This method would be very fast, but clearly suboptimal.
  • Optimal block-wise pathing - This is what I'm going to do next. You can simply divide up the map into blocks, compute boundary all-pairs shortest paths, and then pathfind on the boundaries only. Skipping over the interior of blocks, you will obviously vastly reduce the number of nodes in the search and still be optimal.
    But the big downside is that you have massively increased the branch factor. A 16x16 block covers 256 plots and has 60 plots on the boundary. Each boundary plot connects to the other 59 boundary plots, so the number of edges coming out of one side of a block is almost a thousand. Not great! But... it might be possible to use AVX to accelerate this.
    The second "disadvantage" is that these blocks will have to encode full validity info, which means I need to fully cache pathValid, which is a complex function. But, I should probably do that anyway, as it will give a decent speed up for even basic A*.
    The third disadvantage is memory usage. All-pairs paths is a lot. 3600 16-bit weights per 16x16 block, 7.2KB per block, ~16MB for the whole map. And it's per player, per path flags, per unit movement spec, per unit traits. It's gonna add up. This could be improved by deduplicating blocks.
    *Might be slightly suboptimal as the full pathCost function won't fit into 16 bits. Will have to divide the all-pairs costs by 1000, which mostly divides away the cost from defence bonus.
Also, I considered using C++ modules. You're lucky that didn't work out well. Maybe next time!

*I just found a magical paper: https://ojs.aaai.org/index.php/AAAI/article/view/7813. 1: Why didn't I think of that. 2: I hope it's not too good to be true.
*Too good to be true. It has the same combinatorial problem as above, so I don't know how they get their timings. But the structure of the algorithm is neat.
 
Last edited:
Just a thought (complete non-expert here). With the "Exact Heuristic", you say "precomputation and updates would be slow". How about:

1. On game start (4000BC) it precomputes only the blocks that players/AI can reach in a couple of turns or so.
2. Then during idle time (game is waiting while human thinks), it could begin precomputing the next set of blocks, radially.
3. All computed paths get added to a lookup table.
4. As the civs expand, more blocks are computed while waiting for player input (which could be quite a lot of time).
5. When roads are added to a tile, the relevant blocks are recomputed and the lookup table updated.

Then most of the hard lifting is done while the human is thinking, so they're not waiting ages between turns. They might notice the fan spin though!

(I'm sure there are plenty reasons this wouldn't work but wanted to type it out as I thought of it!)

Paul
 
Just a thought (complete non-expert here). With the "Exact Heuristic", you say "precomputation and updates would be slow". How about:

1. On game start (4000BC) it precomputes only the blocks that players/AI can reach in a couple of turns or so.
2. Then during idle time (game is waiting while human thinks), it could begin precomputing the next set of blocks, radially.
3. All computed paths get added to a lookup table.
4. As the civs expand, more blocks are computed while waiting for player input (which could be quite a lot of time).
5. When roads are added to a tile, the relevant blocks are recomputed and the lookup table updated.

Then most of the hard lifting is done while the human is thinking, so they're not waiting ages between turns. They might notice the fan spin though!

(I'm sure there are plenty reasons this wouldn't work but wanted to type it out as I thought of it!)

Paul
If the map didn't change, computing the exact heuristic for all-block-pairs would be easy at startup. And you can thread it. And you can let the computation run until it fills the map, giving you all heuristic values from one block in one go.

But the updates, those are the problem. You can't just defer them. When a unit wants a path, it needs updated values if you want optimality. And because this kind of heuristic is global, it's invalidated by distant changes.

But, it could be optimised. A path only needs the heuristic values coming from the goal, so one A* search is enough to update the heuristic for one path.

I said it might be okay in practice. But it's probably more computation than necessary. The Landmarks heuristic should be more performant overall as map updates will invalidate less stuff, and landmarks should be plenty accurate.

But if you allow sub-optimality, then you can arbitrarily defer updates. You could maybe do something really efficient like a one-turn lag for all heuristic values. You'd then update the heuristic between turns during the human's turn.

Fast pathing isn't all that difficult, ultimately. I could just slap Hierarchical A* on it and call it a day. I might have to, but I will resist being defeated by this foreboding comment!

Sounds about right. Games with bigger maps these days usually use algorithms that are much faster than A* (but non optimal)
 
Fast pathing isn't all that difficult, ultimately. I could just slap Hierarchical A* on it and call it a day. I might have to, but I will resist being defeated by this foreboding comment!
So it isn't so much an issue of coding but more of an issue of ego? 🤣
 
If the map didn't change, computing the exact heuristic for all-block-pairs would be easy at startup. And you can thread it. And you can let the computation run until it fills the map, giving you all heuristic values from one block in one go.

But the updates, those are the problem. You can't just defer them. When a unit wants a path, it needs updated values if you want optimality. And because this kind of heuristic is global, it's invalidated by distant changes.

But, it could be optimised. A path only needs the heuristic values coming from the goal, so one A* search is enough to update the heuristic for one path.

I said it might be okay in practice. But it's probably more computation than necessary. The Landmarks heuristic should be more performant overall as map updates will invalidate less stuff, and landmarks should be plenty accurate.

But if you allow sub-optimality, then you can arbitrarily defer updates. You could maybe do something really efficient like a one-turn lag for all heuristic values. You'd then update the heuristic between turns during the human's turn.

Fast pathing isn't all that difficult, ultimately. I could just slap Hierarchical A* on it and call it a day. I might have to, but I will resist being defeated by this foreboding comment!
Thanks for replying; interesting to see your thoughts :)

One of my ideas (buried in my list of steps) was the use of idle time (human player looking around and planning) to build/update the values, rather than having it take extra time while the AI turns are calculated. So the human is not sitting waiting for the work to get done, as it's done while the game is waiting for the human.

Wasn't sure if that was even possible but it sounds like from your comment that it is(?) "You'd then update the heuristic between turns during the human's turn."

Very interested in how all this turns out! No experience in pathfinding myself yet, but will encounter the need in a little side-project I'm doing (far simpler than this). Gonna review those Stanford resources you linked to.

Paul
 
Wasn't sure if that was even possible but it sounds like from your comment that it is(?) "You'd then update the heuristic between turns during the human's turn."
To avoid further suboptimality, you could defer only the updates that were not needed by the AI. This kind of thing probably isn't worth it though. Extra complexity, small gain.

Currently, I'm thinking a blend of landmarks/exact heuristics. If you have enough landmarks, then it's basically almost exact. It seems, by the demos, one landmark near (or behind) the start of a path is perfectly fine. So you could have at most one landmark for every X plots, every couple of cities. And because the landmark is near the path, updating it can't be much slower in the worst case than normal A* anyway.

And if the heuristic is near exact, then fingers crossed, that alone should make pathing much faster, before even considering hierarchical or blockwise A*. Instead of visiting, like, 50% of land area, it'll be more like 5%. And the great thing about a heuristic is that it doesn't have to care about pathValid.

I'm hoping the final pathing system will be fast enough to not need path caching. With hierarchical or blockwise A*, maybe one side of the map to the other will take less than 1ms. Suboptimal either way, though. But blockwise A* could be made (possibly) optimal if you take into account remaining movement points upon entering a block, but that means more computation and memory usage.

There is another optimisation. You could do speculative pathing, a "dry run" of all units of one player to cache all expected paths, by imitating the vanilla unit update. And because it's just a dry run, all pathing is done with the initial game state, which means it can be done in parallel if the pathing system can handle it. Then afterwards in the serial unit update, it would hopefully use those cached paths. Pretty close to the magical parallel unit update, but requires path caching, and probably won't be all that useful if pathing is fast.

*Okay, results in (from the test program): Block A* is fast.

Code:
Basic AStar: 38.4655ms (127033 expanded, 302.799ns ns/expand, 981533 edges)
34620

No SIMD:
Block AStar: 14.9763ms (805 blocks visited, 18604.1ns/block, 82.6849ns/plot)
34620

AVX-512:
Block AStar: 0.8841ms (805 blocks visited, 1098.26ns/block, 4.88116ns/plot)
34620

And it was faster without SIMD, which is surprising. And with 70% of land area visited, just the basic heuristic.

16x16 blocks is optimal. Perfect for AVX-512 and only a little slower with AVX2. AVX-512 32x32 is slower than AVX2 16x16.

Surprising bug in the paper, afaik: You must push the block onto the heap unconditionally if any one of its sides has a g-cost reduced. I assume this is because the f-cost you get from the updated edge might be greater than the f-cost after the whole block is updated.

Promising result, but this is with a purely additive cost function. No movement points, just plot movement costs. Taking into account movement points would require an LDDB lookup per boundary plot, and the LDDB would no longer be a simple 2D array.

*Maybe this would be better as the heuristic, if necessary. The LDDB might get out of control if used for pathfinding. 23MB per player, per path flags, per unit type... Deduplication and caching may get complex.

*Landmarks aren't as effective as I'd hoped. They are much better if you assume symmetric costs, but that may cause slight suboptimality. With 4 landmarks and symmetric costs, plotwise A* is ~10ms. Performance doesn't really improve with more landmarks.
*8 landmarks with non-symmetric costs is about the same actually.
*Going nuts with 16 landmarks and AVX-512 to evaluate gets it down to 6.3ms. But that's a lot of landmarks to keep up to date.
*<3ms with AVX-512 plotwise A* and basic heuristic, ignoring movement points: SimdAStar: 2.3767ms (125122 visited, 18.9951ns/visited, 2.37438ns/edge)
*Landmark heuristic may not have any performance problems. If you build a road or chop a forest, you may be able to simply compute max path cost reduction and add that to a global heuristic bias. This should allow you to use an arbitrarily stale heuristic and update it whenever you feel like.
*~1.5ms for AVX-512 plotwise A* and 16x landmark heuristic. ~2.6ms for a longer path from one side to the other. Heuristic evaluation is expensive as it's a 16x gather basically. But using a per-block heuristic selection lookup isn't much of an improvement.
*Also, the big downside of using block A*, LDDB updates, may also be ignored. Could defer and use plotwise A* instead.
*AVX-512 plotwise A* on WSL is much faster at ~0.5ms. Slightly faster with -march=alderlake. MSVC codegen isn't all that great. Longer path is 1.3ms. Probably about time to start integrating this new path finder...
 
Last edited:
Top Bottom