Profiling for turn time processing speed evaluation

Thunderbrd · Jul 17, 2019

This post is for the programmers here. I'll start with the discussion that Koshling and I had a long time back:

Koshling said:
Ok - you'll need Chipotle enabled (I assume you generally have it anyway?) and you'll need a debug build and a spreadsheet to view the results in (I use Excel but any spreadsheet should suffice).

When running the debug version and wanted to profile a turn end enable profiling (ctrl-D as I recall pops up the window [or maybe its shift - some meta-D combo]). Tick DLL profiling and the other obvious sounding one (I forget what the exact options are). After the end-turn has completed and it's your turn again you should find the profile output file in your logs folder - it's called IFP_log.txt

It will contain multiple sections separated by header lines of the form:

Code:

[518965.125] Fn Time (mS) Main thread time (mS) Avg time #calls Child time Self time Parent Alternate Time

These are profiles for each time slice. There will be one big one for the actual end-turn processing - basically scan down visually until you find a large section of data, which will be the one you want and then chop everything before and after it out of the file (leave its header line but remove all previous section, and remove the sections following including their starting header line). Once that is done you should be able to import it into a spreadsheet and get properly columned formatting.

Generally I sort the spreadsheet on the 'Fn time' column, which includes time in things called from that function (so its time in a subtree rooted in each function). Hot-spots will hopefully be visible, especially comparing a save that has a performance problem to one that does not.

I said:
Where do you put the Chipotle code again? It was something I read about when I was just getting setup but I've never used it because I'd perceived it as something purely for enabling Python writing mid-game.

Koshling said:
Chipotle also enables tons of extra shift/ctrl combo hover texts which can be very useful when debugging to see a bit more about what is going on (for instance they tell you the unit and group ids of units, which is invaluable to know how to set breakpoints on those units when trying to debug what they do). If you search these boards for 'Chipotle' you should find what you need (and also a summary of the extra commands it enables, like autoplay and swap players etc.). It goes in your INI file:

Code:

CheatCode = chipotle

Alberts2(after reading the above) said:
Koshling said you have to use a debug build, but doing so could give different results as profiling a release or final release build. This is because of debug build specific code(the release build also has some of that) and compiler optimizations. This means to be sure there is a real performance problem you have to also profile at least with release build and if the same function stands out you have a winner. Profiling the debug build can be usefull to see and track the impact changes make, but to see their real life impact you have to profile the optimized build as well because that is what people use. The profiler is at the moment active in the debug and release builds but not in the final release.

I said:
I had similar thoughts on the subject - but would the debug dll have that many processes that differ so greatly that it couldn't still point out hotspots with fair accuracy? I wouldn't begin to know how one would be able to turn it on for the final release vs the debug dll...

Alberts2 said:
It could differ alot in some places. You have to set the define USE_INTERNAL_PROFILER also for the final release build to enable the profiler for it.

I think the conversation ended there and I have probably not done any further profiling for myself since then because well... I did not know where or how to set the define he was talking about. Perhaps someone can clarify that for me and hopefully all of this serves as a good enough start to the conversation about how to profile CivIV properly.

KaTiON_PT · Jul 17, 2019

Thanks for starting this thread.

There's also an alternative called Very Sleepy brought forth by @Nightinggale a while ago in this thread which has shown very good results even compared to the profiler that exists in Visual Studio. @billw2015 was able to discern all functions without having broken ones.

billw2015 · Jul 17, 2019

KaTiON_PT said:
Thanks for starting this thread.

There's also an alternative called Very Sleepy brought forth by @Nightinggale a while ago in this thread which has shown very good results even compared to the profiler that exists in Visual Studio. @billw2015 was able to discern all functions without having broken ones.

Although caveats are required, just because it is resolving most function calls, doesn't mean it got it correct, and the results it gives show some weird artifacts. It appears to be treating wait operations as cpu time, causing most of the top results in that image I think. In one way this is useful, assuming it is correct: 6 seconds waiting at critical section is pretty bad over a 130s end turn time, it indicates some fairly bad contention potentially.

I'm going to try the inbuilt profiler now and see what it looks like.

Thunderbrd said:
I think the conversation ended there and I have probably not done any further profiling for myself since then because well... I did not know where or how to set the define he was talking about.

I definitely wouldn't worry about separately profiling final release vs release. I find about a 5% difference in performance between them (based on just timing a round, so take it with a grain of salt), which means nothing in there is likely to show up much different on a profile, it is probably mostly the profile code itself that is making the difference. It is debug vs release that is the real change and the internal profiler is enabled for both of these currently.

Thunderbrd · Jul 17, 2019

billw2015 said:
I definitely wouldn't worry about separately profiling final release vs release. I find about a 5% difference in performance between them (based on just timing a round, so take it with a grain of salt), which means nothing in there is likely to show up much different on a profile, it is probably mostly the profile code itself that is making the difference. It is debug vs release that is the real change and the internal profiler is enabled for both of these currently.

We were talking about final release vs debug.

billw2015 · Jul 17, 2019

Thunderbrd said:
We were talking about final release vs debug.

Yeah I meant forget profiling final release full stop, the only difference between release and final release (that I can see) is that release has profiling enabled. This makes sense, it is pretty common in games to have a build that has profiling (+ other dev tools in some cases) enabled, which are then disabled in shipping build, and for there to be no other difference.
All I needed to do is enable cheatmode, Ctrl+D and this:

Here are my results from the internal profiler:
https://docs.google.com/spreadsheets/d/1S5vmLsKS1hnL1ipSzOMw0bjc4b3A6GNJsE5SyVPWN6E/edit?usp=sharing

Nightinggale · Jul 17, 2019

billw2015 said:
Although caveats are required, just because it is resolving most function calls, doesn't mean it got it correct, and the results it gives show some weird artifacts. It appears to be treating wait operations as cpu time, causing most of the top results in that image I think. In one way this is useful, assuming it is correct: 6 seconds waiting at critical section is pretty bad over a 130s end turn time, it indicates some fairly bad contention potentially.

You should set the root to CvGame::doTurn (I think that's the one, didn't look it up right now) and then profile while hitting end turn. Stop when the next turn starts and you have a measurement of the most time critical part of the game: the wait for the AI. Since the active CPU core is supposed to be 100% loaded during the AI turn, counting waiting time shouldn't be an issue.

I admit I haven't actually tried measuring multithreaded code. Maybe it doesn't work if threads are waiting for each other if it counts the wait as CPU load. This is actually an interesting question I haven't considered until right now. So far mods for the civ4 engine have all been single threaded when I looked at them.

Thunderbrd said:
We were talking about final release vs debug.

It makes no sense to profile debug. You need to know which parts of the code causes slowdowns after the compiler has run optimization. The debug DLL skips optimization because the debugger can't link code to source lines if optimized. You need to profile the release DLL. I added a profile DLL in Makefile 2, which is just as optimized as the release, but it has the symbols, meaning profiling can be linked to the lines in the code. That would be the ideal DLL for profiling.

Thunderbrd · Jul 17, 2019

Nightinggale said:
It makes no sense to profile debug. You need to know which parts of the code causes slowdowns after the compiler has run optimization. The debug DLL skips optimization because the debugger can't link code to source lines if optimized. You need to profile the release DLL. I added a profile DLL in Makefile 2, which is just as optimized as the release, but it has the symbols, meaning profiling can be linked to the lines in the code. That would be the ideal DLL for profiling.

We have a very unique Makefile in use in our setup to help with enabling different file paths in use in an SVN team environment. Someone would need to adapt yours into ours for this but I'd invite them to do it!

billw2015 said:
Yeah I meant forget profiling final release full stop, the only difference between release and final release (that I can see) is that release has profiling enabled. This makes sense, it is pretty common in games to have a build that has profiling (+ other dev tools in some cases) enabled, which are then disabled in shipping build, and for there to be no other difference.
All I needed to do is enable cheatmode, Ctrl+D and this:

I think Alberts2 then did some things to enable this with the final release. As I understand it we only use the final release and debug dlls. Previous to this discussion with Alberts2, Koshling had instructed me on using the debug dll for the profile.

billw2015 · Jul 17, 2019

Nightinggale said:
...and then profile while hitting end turn. Stop when the next turn starts and you have a measurement of the most time critical part of the game: the wait for the AI.

Yeah all my profiles are just from when I click end turn to when the next turn starts, this is what I have posted here.

Nightinggale said:
I added a profile DLL in Makefile 2, which is just as optimized as the release, but it has the symbols, meaning profiling can be linked to the lines in the code.

Release in the Makefile, appears to be what I would usually call a profile build: it has optimization on compiler, and /DEBUG and pdb on linker, and the internal profiler enabled. Can you please post your Makefile2 or at least the compile and linker flags so I can compare if it isn't too much trouble?

Thunderbrd · Jul 17, 2019

billw2015 said:
Release in the Makefile, appears to be what I would usually call a profile build: it has optimization on compiler, and /DEBUG and pdb on linker, and the internal profiler enabled. Can you please post your Makefile2 or at least the compile and linker flags so I can compare if it isn't too much trouble?

A link is in his signature line.

billw2015 · Jul 17, 2019

Thunderbrd said:
Someone would need to adapt yours into ours for this but I'd invite them to do it!

Yeah I will have a go, it looks good. At least I can steal the nice features anyway like multicore build! I would like to put the fastdep.exe into the repo as well, it is tiny and makes no sense you have to download it separately to build the dll.

Thunderbrd · Jul 17, 2019

billw2015 said:
Yeah I will have a go, it looks good. At least I can steal the nice features anyway like multicore build! I would like to put the fastdep.exe into the repo as well, it is tiny and makes no sense you have to download it separately to build the dll.

Cool... works for me. I imagine @alberts2 won't have any issue with that either but I figure he should say if he has any objection.

billw2015 · Jul 17, 2019

What is the purpose of Release build vs Final_Release? It looks like it is meant to be the profile build really?

Thunderbrd · Jul 17, 2019

billw2015 said:
What is the purpose of Release build vs Final_Release? It looks like it is meant to be the profile build really?

That's possible. I have never used the release build myself so it's something another modder has been using, I presume Alberts2.

Nightinggale · Jul 17, 2019

billw2015 said:
Yeah I will have a go, it looks good.

It should be good. It took ages to get right even though I at that time was already experienced with GNU make. Most of the remaining imperfections are due to limitations with nmake rather than the makefile itself. Stuff like copy pasted lines instead of writing it once and use wildcards. Apparently nmake is primitive when it comes to wildcards.

billw2015 said:
At least I can steal the nice features anyway like multicore build!

Go for it. I wrote the makefile to be used. If you feel like you need to modify it, then do so, but do make sure the top of the file clearly identifies which makefile it is. I do have a modified version of Makefile 2.5 as well (mod specific) and the top states it is mod specific and not the general distributed makefile.

Be aware that compiling with symbols is single threaded, meaning it boosts compile speed of release, but not debug or profile. Also it compiles the precompiled header (first file) and linking (last step) on a single core as well. Between that, it can compile one file on each core. Having said that, I tend to compile twice as fast when using 4 cores than when using a single core.

Btw I highly recommend the assert DLL build. It's optimized like the release, but it has all the assert checks. Playing with that one is fast enough to play for fun (like 30% slower) and you will be notified whenever an assert check fails. This makes it much easier to locate bugs and make the game run more stable.

Thunderbrd · Jul 17, 2019

Nightinggale said:
Btw I highly recommend the assert DLL build. It's optimized like the release, but it has all the assert checks. Playing with that one is fast enough to play for fun (like 30% slower) and you will be notified whenever an assert check fails. This makes it much easier to locate bugs and make the game run more stable.

Interesting - some of these asserts we have that have been going on forever may actually find some resolution if we start using this.

billw2015 · Jul 17, 2019

Nightinggale said:
Btw I highly recommend the assert DLL build. It's optimized like the release, but it has all the assert checks.

Yeah definitely I will add it. Probably I won't bother with the multicore actually seeing how much extra stuff it requires (i.e. not just a compile flag!) and that it doesn't work for the builds we are regularly compiling anyway!

f1rpo · Jul 18, 2019

Is C2C already being compiled with whole program optimization (/GL)? If not, then @devolution's final release configuration could be of interest as well: GitHub link
As a result, we've seen a ~10% speed-up of AI turns.

olduser · Jul 18, 2019

f1rpo said:
Is C2C already being compiled with whole program optimization (/GL)? If not, then @devolution's final release configuration could be of interest as well: GitHub link
As a result, we've seen a ~10% speed-up of AI turns.

Doesn't work in C2C because C2C uses parts of boost 1.32 which break whole program optimization.

olduser · Jul 18, 2019

billw2015 said:
Although caveats are required, just because it is resolving most function calls, doesn't mean it got it correct, and the results it gives show some weird artifacts. It appears to be treating wait operations as cpu time, causing most of the top results in that image I think. In one way this is useful, assuming it is correct: 6 seconds waiting at critical section is pretty bad over a 130s end turn time, it indicates some fairly bad contention potentially.

The multithreading in C2C is really inefficent causing lots of waits and it is no longer needed. It was added years ago them the turn times where alot longer as they are now. The reason for those long turn times was really inefficient coding in some places which i optimized alot.

That multithreading will be removed in a big commit including some other performance improvements some time after v39 is released .

billw2015 · Jul 18, 2019

alberts2 said:
That multithreading will be removed in a big commit including some other performance improvements some time after v39 is released .

Thats good, can I get these changes from a branch or something? I don't want to do much coding if I don't have an accurate profile, or I will get a bunch of conflicts!

Profiling for turn time processing speed evaluation

C2C War Dog

Emperor

King

C2C War Dog

King

Deity

C2C War Dog

King

C2C War Dog

King

C2C War Dog

King

C2C War Dog

Deity

C2C War Dog

King

plastics

Emperor

Emperor

King

Similar threads