[Dev] Performance Improvements Thread

Quintillus · Nov 22, 2021

Splitting this out from my "not really sure which thread this goes in" post last night. Performance will likely be something that is off-and-on, not tied to any particular milestone, but rather an accumulation of slowness. It's also an area where sharing techniques could be beneficial.

Lately, I've been focusing on improving responsiveness, with two main goals, the Game scene load, and the Disband Confirmation popup (which I'm using as representative of all popups). The main tool in my toolbox is the C# Stopwatch:

Code:

Stopwatch loadTimer = new Stopwatch();
loadTimer.Start();

//Do some things and stuff

loadTimer.Stop();
TimeSpan stopwatchElapsed = loadTimer.Elapsed;
GD.Print("Game scene load time: " + Convert.ToInt32(stopwatchElapsed.TotalMilliseconds) + " ms");

This has allowed me to see what part of the load time is slow, and thus use Amdahl's Law to target areas for improvement. I've also been using the _EnterTree() method in Godot as the start point:

Code:

    public override void _EnterTree()
    {
        loadTimer.Start();
    }

    public override void _Ready()
    {
        //do things
        //stop timer, record load time
    }

The above gives an estimate of the overall load time (although it doesn't capture time before the node enters the tree; is that significant?), and by moving the stopwatch start/end, or using additional stopwatches, we can figure out which areas are slow.

Using this methodology, I've made a couple targeted improvements:

- Optimizing the PCXToGodot.cs file so that when creating an Image/ImageTexture from a PCX, it only converts the cropped size of the image (when cropping is used), instead of the whole PCX. In some cases such as popup border graphics, this results in 98% fewer pixels being converted, with a corresponding increase in speed.
- Caching ImageTexture and PCX conversion, so we don't have to do them more than once. This can result in the second display of a pop-up taking almost 80% less time than the initial time, and also results in savings when graphics such as leaderheads are used in multiple places.

Combined, this has taken the time to load the disband popup down by more than 80% on first load, and 96% on subsequent loads, from 225 ms on every load to 38 ms (first load), and 8 ms (subsequent loads).

These improvements should be broadly applicable to PCX loads as well, particularly ones that use cropping.

Next I plan to look at implementing some similar improvements for FLC loading.

Quintillus · Nov 23, 2021

I've now had a chance to examine FLCs a little bit. Stopwatch timing showed that loading the Warrior takes a little over 700 milliseconds, and that's just the Civ3Unit; once it's loaded, the move/animate/etc. is negligible. With PCXs, I found an easy optimization with cropping, but it appears that the FLC is not cropped, which makes sense. So what do we have?

Spoiler Overview of Current Status :

Spoiler Ideas Inspired by a Pentium II :

Progress may be delayed by playing more turns in COTM 157 (Hittites, Regent).

-------

Update:

Spoiler Unsuccessful micro-optimizations, ideas derived from failures :

------

Update:

Spoiler Amdahl's Law Illuminates Potential Improvement Areas :

WildWeazel · Nov 23, 2021

Quintillus said:
On my machine (2011, Core i5, 3.6 GHz), that equates to 6 million pixels processed per second.

Vanilla Civ3 has 77 units (counting industrial/modern variants of leaders, settlers, etc.; there are no king units in Vanilla). That gives us a rough guesstimate of 54 seconds to load all the FLCs on my machine. That sounds less good than < 1 ms per image does.

I'd be curious to benchmark it on mine (Ryzen 4.7GHz) for comparison. Do you know if it's IO or memory bound, and will it parallelize well?

We could also considering loading FLC/directions as needed on demand. E.g. we only load Warrior Run the first time the Warrior runs, and perhaps we only load the specific direction that is loaded. This might be what Civ does; I recall a slight animation lagginess on my Pentium II, which may reflect it processing a FLC on-demand on its 20th century processor, and would make sense given that it has to run on 128 MB of memory, which isn't enough to fit all the FLC files.

Possible, but that seems like unnecessary optimization these days. Modern games load multiple Gb, surely we can streamline a few hundred Mb to be faster than they did it on a Pentium. Would one-time file conversion help with this at all? (I've always felt we could do more that way, and have been reading up on Godot's WebP support )

Progress may be delayed by playing more turns in COTM 157 (Hittites, Regent).

Ha, I've been meaning to play that too.

(Another thought that occurs to me is that perhaps calling SetPixel on the Godot.Image a million times [literally!] is a limiting factor in performance. Could it be more efficient to use LoadBmpFromBuffer(byte[] buffer], after having created an in-memory byte array representation of the BMP? It might sound crazy, but I wouldn't be shocked if that were more efficient)

Seems likely

Quintillus · Nov 24, 2021

WildWeazel said:
I'd be curious to benchmark it on mine (Ryzen 4.7GHz) for comparison. Do you know if it's IO or memory bound, and will it parallelize well?

Best guess is memory, perhaps somewhat CPU as well. Disk IO falls into the "everything else" category, so it's not a major factor at this point.

I'd also be curious to see how it does on such a modern system. My memory is DDR3 at 1600 MHz, with 9-9-9-24 timings. Which means the latency is pretty decent even by today's standards, but the bandwidth is much inferior to a modern kit of DDR4. The computer nerd in me is tempted to turn off my XMP profile and see how that affects things, but I probably won't actually do that.

We could parallelize it pretty easily (by FLC would parallelize it 8 ways), but whether we'd see a big speedup I'm less sure of. When I added multi-threading to PCX imports on my editor, the speedup was less than one might hope for. On my quad-core, dual-channel memory system, the speedup was somewhere in the 1.5-2.25x range, IIRC.

It's always interesting seeing where bottlenecks are. I'm reminded of how CFC member T.A. Jones posted about how Civ3 ran much quicker on Cedar Mill Pentium IV processors than earlier Pentium IVs; the big difference was they have significantly more L2 cache. Apparently some of the data used by the AI fit in that larger cache, but had to call out to main memory with smaller caches, at least for the map sizes he was playing.

Possible, but that seems like unnecessary optimization these days. Modern games load multiple Gb, surely we can streamline a few hundred Mb to be faster than they did it on a Pentium. Would one-time file conversion help with this at all? (I've always felt we could do more that way, and have been reading up on Godot's WebP support )

It does feel like it should be unnecessary these days. And yes, a one-time conversion to a format Godot can slurp up natively would likely improve start-up time. Whether that's a WebP, PNG, or a serialized version of ImageTexture, I don't know.

Although, civ color is another variable that we haven't fully wrangled AFAIK, and which probably (?) would impact a one-time conversion. That might mean we have to convert for each civ color (which we'd have to do in a non-one-time live conversion too, though perhaps with some optimizations to reduce the impact), and then keep track of whether a mod used non-default civ color files (ntp##.pcx). Which really isn't any worse than how Civ4 uses cached Python scripts for the non-modded game, and uncached Python when you load a mod (caveat: it's been a year since I played Civ4, and longer for modded Civ4).

Quintillus · Nov 24, 2021

I've made a breakthrough in FLC import performance by following the "could we save time by using LoadBmpFromBuffer instead of SetPixel?" route.

The tl;dr is FLCs now import in 60% of the time they did before (420 ms, versus a bit over 700 ms).

The code is on a branch here, showing the revamped ByteArrayToImage which uses a helper method to get a validly-formatted in-memory BMP file that Godot can then slurp up directly and much more quickly.

I'd written a BMP parser previously (in Java), so the format was already familiar. It's a pretty simple format to, which is why I gravitated towards it to try to prove this concept. And it's only a few hundred KB in memory at a time, so the fact that is uses more space than more modern formats doesn't matter.

One major call-out is that the getBmpBuffer function I added is low level, and as written is unsafe. This allows flexibility in treating the addresses within the byte buffer as whichever type makes sense for that location in the buffer, and the result is an 80% speedup over the old SetPixel calls. However, it is a new direction that we may not want to pursue. It may also be possible, albeit a bit more convoluted in reasoning (splitting ints into their constituent bytes to shove them into a byte array, for example), to convert that basic concept to safe code.

I'm unsure whether I'll continue down the FLC optimization route in the near future; a 40% improvement is at least decent. If so, it will probably be time to check with Amdahl again, or perhaps to try loading a few in parallel and seeing how that goes.

I also checked the Godot docs for ImageTexture.CreateFromImage(ImgTxtr, 7) to see what 7 means, since that is now almost half the time. The answer is that it creates mipmaps (smaller versions of the texture to use when zoomed out), repeats the texture (versus clamping to the edge; I don't know the implications of that yet), and "Uses a magnifying filter, to enable smooth zooming in of the texture". Removing the mipmaps takes the load time down from 420 ms to 265 ms (-37%); the magnifying filter and repeating do not have a measurable impact on performance. We might want the mipmaps since we have a zoom function, but it may be worth reading/experimenting more with that flag enabled and disabled, to see what impact it makes in practice. Do the mipmaps save video memory when zoomed out, or make it look better, or something else, and based on what the answer is, is it worth a longer load time?

Puppeteer · Nov 24, 2021

Quintillus said:
(Another thought that occurs to me is that perhaps calling SetPixel on the Godot.Image a million times [literally!] is a limiting factor in performance. Could it be more efficient to use LoadBmpFromBuffer(byte[] buffer], after having created an in-memory byte array representation of the BMP? It might sound crazy, but I wouldn't be shocked if that were more efficient)

Quintillus said:
It does feel like it should be unnecessary these days. And yes, a one-time conversion to a format Godot can slurp up natively would likely improve start-up time. Whether that's a WebP, PNG, or a serialized version of ImageTexture, I don't know.

Oh yeah, that's ringing a bell. At the time I made the PCX and FLC readers I was just trying to get the image data out, and byte arrays made the most sense as a general purpose intermediate state to wherever it goes next. (I was also making output images and videos with ImageSharp and/or ffmpeg.)

PoolByteArray is the Godot byte array thing, and you can Image.CreateFromData() with a PoolByteArray and then ImageTexture.CreateFromImage() to get the texture probably more efficiently than using byte arrays and SetPixel. Off the top of my head I'm not sure how we go from bytes to RGBA Color values, but the path seems to be there...ok, yeah, it looks like we'd just e.g. add 255, 0 ,0 , 255 to the PoolByteArray for a fully opaque red pixel with Format.RGBA8 (or Format.Rgba8, unsure of the actual case), or maybe use Format.RGB8 if we don't need to specify opacity/alpha, but I think we do.

Edit: Cross-posted with @Quintillus . Mipmaps make it scale better in one direction...down I think? Found this in the ImageTexture page:

If the image does not have mipmaps, they will be generated and used internally, but no mipmaps will be generated on the resulting image.

Note: If you intend to scale multiple copies of the original image, it's better to call generate_mipmaps] on it in advance, to avoid wasting processing power in generating them again and again.

Puppeteer · Nov 24, 2021

Here is Flag.Mipmaps plus Flag.Filter (the one that helps with scaling up) versus not. I *think* these are groups of four at the same scale, and I *think* the second from the left is 1:1 scale.

Quintillus · Nov 24, 2021

Now that you mention it, I feel like I saw PoolByteArray in the docs a few weeks ago. It makes sense that Godot would have a class designed to address this general sort of problem. That might be a good alternative to what I have done. Although I am left rather confused by this part of the docs:

Code:

[LIST]
[*]void append ( int byte )

[/LIST]
Appends an element at the end of the array (alias of push_back).

Then when I click on the int link, it tells me int is a "Signed 64-bit integer type." So shouldn't it be called something other than byte in the documentation? It also seems a bit odd that it would take a 64-bit word rather than 32-bit; at least for our purposes 32-bit would be preferable.

IOW, it may work, but I'm a little concerned by the type inconsistency in the docs.

It also occurred to me that in Java, when I need to convert a byte array, I use the DataOutputStream class, which has utility methods for writing all the primitive types to a byte buffer. A C# equivalent for that might be a good alternative to what I wrote should PoolByteArray not be the right answer (although there's a good chance PoolByteArray will do the job).

(And regardless of which way we choose, it's been fun diving into this, and learning more C# in the process)

(Edit: Crosspost with Puppeteer.

Very nice demo of the difference mipmaps makes! It's kind of the opposite of what I'd expected based on the wording; it seems to do interpolation on zoomed-in/enlarged versions of the texture, to make it less blocky, albeit a bit blurred to achieve that)

Puppeteer · Nov 24, 2021

As far as the type weirdness...that seems to be a feature of Godot. Each Pixel of the image is a Color, and there is no way to set Color with four bytes! Each part of Color is a float from 0 to 1, but if you try to pass it values higher than 1 it seems to normalize based on the highest number which works as expected if the highest number is 255, but not if it's 128 for example. I ran into that when viewing the tile bytes as intensity values.

~~So I'm really confused why you can't SetPixel with int colors, but you can use a PoolByteArray~~. Oh wait, I just now understood for the first time ever that there *is* an int-based Color constructor...but it's an int32 (effectively, but probably just the lower 32 bits of an in64?) that represents the four 8-bit RGBA values. Still kinda confusing, but at least I see why PoolByteArray might work, but now I'm wondering if it's coded as such, too, which is why you saw int64? So maybe my example of a red opaque pixel would be adding one 0xFF0000FF element to the PoolByteArray instead of 0xFF, 0x00, 0x00, 0xFF elements. Dunno, haven't tried it.

Flintlock · Nov 25, 2021

Speaking of performance problems, I'm not satisfied with how MapView is doing. Dragging the map while zoomed out all the way is choppy, at least on my machine. Also I noticed this got worse in Godot 3.4 compared to 3.3. Do you guys have the same experience?

So I tried a little experiment (it's uploaded to the ExperimentsWithMapView branch), creating a loose layer to display terrain sprites and using that instead of the existing terrain layer which is a Godot TileMap. This improved performance significantly, measured subjectively by the fact that I can now drag the map around without stuttering. The loose terrain layer isn't yet a full replacement for the old one since it doesn't pick the terrain sprites based on neighboring terrain, but I doubt doing so would have a noticeable performance impact. Switching between textures might be a perf problem but that's solvable by sorting by texture before submitting the draw commands. The loose layer approach has other advantages like it's simpler, stateless, and more modular, so it's definitely the way to go unless we can find something even better.

As for what even better might be, I'd like to try rendering the terrain with a triangle mesh and custom shader. I hope that can solve the problem of the seams between tiles that appear whenever the zoom scale is not 1.0. I was hoping the loose terrain layer would fix the seams but it didn't. There might be an easier way to fill the seams, like changing a rendering setting somewhere. I saw Puppeteer mention this problem before, @Puppeteer do you have any ideas as to a solution? A mesh might be overkill for this and it might cause additional problems like make it difficult to get pixel-perfect sprite drawing at 1.0 scale and limit how modular the terrain sprite selection code is (i.e. make it more difficult for modders to add terrain types). On the other hand, we might want to draw units using custom shaders to make coloring them easy so this could be a useful experiment in preparation for that.

Puppeteer · Nov 25, 2021

There are flags on the texture I think (or somewhere) for mipmaps, interpolation, and I *think* somewhere there is an edge/alpha blend setting. We should try turning all those off for base terrain and see how it changes. I definitely recall having some unwanted blending, and my suspicion is that Godot is feathering, antialiasing, or interpolating the edges against their 0 alpha neighbors, and that's causing the gaps when scaled.

In this post in the One Turn Deserves Another III thread I posted some 10x scaled test images illustrating what I think the problem is here. I recall fixing it, but I scanned through that thread multiple times and don't see where I specifically said which setting it was, but I'm pretty sure the Flag.Filter value had been the default 7 and it got better at flag 0...that will be somewhere in creating the ImageTexture object I think.

Also, I forgot that I did touch Godot before this year. That post is in 2018, and I had even coded a PCX reader in GDScript! So I guess my 2021 C# PCX code is a port of that.

Interesting that TileMap appears to be slower than individual Sprites, but now that I think about it, that makes sense. Godot's TileMap is intended to do a lot more than what we're using it for, so there is probably some extra baggage dragging the back end down a bit. I think we're using it because it's a convenient way to turn the terrain files into Sprites, but it shouldn't take too much code for us to manage dividing the image up into tiles ourselves and making individual textures or cropped textures, whatever works best.

Edit: Oh...I found that we are setting the flag at 0 for the full ImageTexture here: https://github.com/C7-Game/Prototyp...50e2d2d8e17867904cda85c2/C7/PCXToGodot.cs#L97 ... buuuut it probably needs to be set again when dividing it into tiles?

Puppeteer · Nov 25, 2021

Hmm, I don't see any per-tile setting in TileSet or TileMap that would affect filtering, and we seem to have filtering turned off at the ImageTexture level.

I now remember why I picked TileMap in the first place: I was hoping the built-in bitmask feature would have Godot select the proper sprite based on the surrounding terrain values, but that doesn't work as hoped for a couple of reasons:

In Civ3, more than one tile may fit any given bitmask; e.g. Ocean has well over 81 all-ocean tiles and not just one
Original Civ3 stores the file and index reference to the tile to draw the correct tile, so the would-be bitmask selection happens only at map generation time, and thereafter we have a direct pointer to the correct tile for each map square, so the display code shouldn't have to worry about that at all

So yeah, this evolved slowly enough that I didn't realize there is very little point in messing around with TileMap and TileSet anymore.

Vuldacon · Nov 25, 2021

... in all new developments there will always be trials and errors with discoveries and in the end direct simplicity that works will prevail.

Keep up the Good Work :thumbsup:

Puppeteer · Nov 26, 2021

@Flintlock , hold up on optimizing the map display for the moment, please. I'm currently (today, right now) trying to get the map display code from TempTiles (and ~~LegacyMap~~ Civ3Map) in the map-tile-overhaul branch.

As a side effect, I'm also having to finally invent a save file format, but I'm integrating it with what we already have as far as code structure.

Quintillus · Nov 26, 2021

Flintlock said:
Speaking of performance problems, I'm not satisfied with how MapView is doing. Dragging the map while zoomed out all the way is choppy, at least on my machine. Also I noticed this got worse in Godot 3.4 compared to 3.3. Do you guys have the same experience?

I don't think I'd tried that before, as usually I zoom in almost right away. But now that I try it, yes, it is choppy on the 3.6 GHz 2500k. Not sure if it's worse than with 3.3, since I hadn't tried it before. Seems to be pegging a CPU core, when I do it with Task Manager up next to it. The skiing Warrior stops skiing while I drag it around, too.

It makes me wonder what we can do to more objectively measure performance. I'm sure there's an FPS monitor of some sort in Godot. But even if not, we may be able to find a way. I implemented one in my editor, which had no such concept natively, to be able to measure rendering performance. Which is a lot lower than where I'd like it to be, but it might actually be better at a similar zoom as I compare them side-by-side, suggesting there is indeed a lot of opportunity for improvement.

I do think it's a good idea to hit pause while Puppeteer merges the changes we want from LegacyMap. I took a look at it and saw the point that we'd want stuff from it, but have been celebrating Thanksgiving too much to do more than that.

Puppeteer · Nov 26, 2021

Puppeteer said:
I'm also having to finally invent a save file format

FYI, "Inventing a save file format" has mostly reduced down to serializing GameData after a couple of false starts. I have an importer to read a Civ3 save into a GameData–at least with enough map info to display the base terrain–and think I'm close to the point of swapping the Development display code out for what was called LegacyMap but is now called Civ3Map.

I do think we'll want to break out BIQ analog stuff into a different hierachy because that's basically what a mod is, and right now terrain types are in GameMap. But I was successfully able to prevent myself from getting lost in coding minutiae for now.

Puppeteer · Nov 27, 2021

Ok, I merged my map overhaul. It actually works and looks very much like before, but we're no longer generating terrain, we're loading it from JSON. (This might break; we may have to move the json file in the C7 folder for exporting, but not sure yet.) And of course we have desert, tundra, sea, and ocean as well as the plains, grass, and coast we've been staring at.

I was concerned about integrating it with what Flintlock did with MapView, but actually that was easy once I realized I just needed to pass the TileSet and the tile reference matrix which work exactly the same between the old and new version.

So CreateGame() now loads a GameData (and its map) from JSON, and then it runs createDummyGameData() which has been modified to remove map generation and therefore no longer needs a noise generator parameter.

I copied the three logical tile types into ImportCiv3 (new), and it calls desert and plains a plain, grass and tundra grass, and all water coast. So the game logic is unchanged, and the find locations loop still finds only land tiles.

I don't have an interface yet, but the code is there to read a Civ3 file directly go GameData, and to save and load GameData as JSON. (Well, actually C7SaveFormat which has a GameData as its most prominent field.)

So on the one hand, kinda huge changes. On the other hand, it's pretty much in the exact same working state it was before.

Edit: I kinda went out-of-scope for this thread. Oops. Basically I'm saying everyone can try optimizing again.

Flintlock · Nov 27, 2021

Puppeteer said:
Hmm, I don't see any per-tile setting in TileSet or TileMap that would affect filtering, and we seem to have filtering turned off at the ImageTexture level.

It occurred to me that we might be able to fix the seams easily by slightly scaling up the target rects for texture drawing. That should work if the screen pixels in the seams are just on the edge of getting filled by opaque pixels from the sprites.

Puppeteer said:
Interesting that TileMap appears to be slower than individual Sprites

One of the reasons it's so slow is that MapView doesn't use TileMaps the way they're (presumably) intended to be used. Whenever the camera moves or anything in view changes, MapView will clear and refill the terrain TileMap with the tiles in view. Even so, it's bafflingly slow. Our current "dummy" map is only 3200 tiles and each terrain tile code is just an int. It's absurd that storing <= 3200 ints could somehow take a substantial fraction of a second. Maybe Godot is blocking while it waits to upload the changes to VRAM, for every single change? Just a guess but there must be some reason like that.

Puppeteer · Nov 27, 2021

I would suggest optimizing for speed before trying to fix the gaps. I feel really strongly that it's an interpolation or antialiasing issue and we just need to find the right lever to flip. And if we're getting rid of TileMap we may accidentally stumble across it or maybe find that TileMap was trying to blend what it thinks are rectangular tiles behind our backs, but we're cheating and overlapping the rectangular because fake isometric.

Flintlock · Nov 27, 2021

The first thing I'm going to do is recreate that optimization, changing the terrain layer over to a loose layer, and I'm going to go ahead and merge it in instead of leaving it on an experimental branch. (As an aside, I tried merging the current version of Development into ExperimentsWithMapView but oddly I didn't see MapView.cs listed as part of the merge. Is that because the merge would revert it to an earlier version? Anyway I expect it'd be easier to redo the changes manually.)

By the way, do you have any plans for the Civ3Map class beyond what it's currently being used for? Because once the TileMap is dropped from the terrain drawer, TerrainAsTileMap won't be needed anymore, and I intend to move the terrain sprite sheet loading into the terrain layer constructor. That would leave Civ3Map almost empty.

Edit: Done. The map can now be dragged around without a hitch while displaying all the terrain sprites. Unfortunately the seams are still there.

[Dev] Performance Improvements Thread

Resident Medieval Monk

Resident Medieval Monk

Going Dutch

Resident Medieval Monk

Resident Medieval Monk

Emperor

Emperor

Resident Medieval Monk

Emperor

Emperor

Emperor

Emperor

Dedicated to Excellence

Emperor

Resident Medieval Monk

Emperor

Emperor

Emperor

Emperor

Emperor

Similar threads