DLL development discussions

:badcomp:
@Koshling, did you know, since multicore processors, there is no memory atomic instructions even on aligned addresses? Even stupid read or write are not atomic! And I must again analyse everything from the beginning. :(

Or do we assume, the game can, and always will be able to, use only one core at once? -- I have a feeling, it could crash, if not even a single CPU instruction would be atomic, but I can be wrong.

Not true. In single CPU (not single core) system memory read/write is atomic (well serialized via the cache, so logically atomic). On multi-CPU system you have to use interlocked instructions, which is why you'll find several places where things like InterlockedIncrment() and so on are used. These force cache-convergence in multi-CPU systems (so it has a penalty), so guarantee consistency.

Also the use of critical sections in the current code ensures that the multiple threads are not operating on the same data at the same time in most cases (the exceptions should all be using Interlocked instructions as discussed above, but for the most part critical sections guard the bits that matter). The result is that, most of the time the various threads are operating concurrently, but not trying to access the same data. On the (relatively rare) occasions they contend for access to a piece of data, they are serialized via critical sections (or in a few simple cases use of Interlocked methods)
 
Not true. In single CPU (not single core) system memory read/write is atomic (well serialized via the cache, so logically atomic).
I am not aware of how exactly the cache do work, but on StackOverflow I got answer like this, when I asked about simple MOV reg, mem (Intel order).
P6+ also guarantees atomicity of unaligned dword reads that do not cross cache line boundaries. But you don't have that guarantee either
So does it means, if a dword would be out of the cache and two cores would request for it exactly at once, this would deatomize them?
 
I am not aware of how exactly the cache do work, but on StackOverflow I got answer like this, when I asked about simple MOV reg, mem (Intel order).

So does it means, if a dword would be out of the cache and two cores would request for it exactly at once, this would deatomize them?

Yes, unless the accessing instructions are interlocked (which causes the cache to serialize access). That's why you must ensure you don't have two threads working on the same data at the same time (critical sections etc.) OR ensure that they only manipulate it via Interlocked instructions.
 
Do you know union-find? I believe so. My structure in similar way access to elements. So even fetching involves writing and as I have a lock per element not per whole structure, this complicates the case. If I can't assume the code goes on one core, it will cause involve two locked instructions per get element operation. And locks are heavy, from what I read. Like ~100 cycles.

Or maybe ... Would be safe to read a byte which holds a lock (just one bit flag) and then do Bit-Test-Set instruction on it? Would the first read guarantee the value will be cached the instruction after?

But still even so, if this code would need to be safe for multi-CPU machines, this would not help about locking - if I do understand things. And am not sure would two LOCKs per getting an object would not consume the advantage in speed I gained.
 
Do you know union-find? I believe so. My structure in similar way access to elements. So even fetching involves writing and as I have a lock per element not per whole structure, this complicates the case. If I can't assume the code goes on one core, it will cause involve two locked instructions per get element operation. And locks are heavy, from what I read. Like ~100 cycles.

Or maybe ... Would be safe to read a byte which holds a lock (just one bit flag) and then do Bit-Test-Set instruction on it? Would the first read guarantee the value will be cached the instruction after?

But still even so, if this code would need to be safe for multi-CPU machines, this would not help about locking - if I do understand things. And am not sure would two LOCKs per getting an object would not consume the advantage in speed I gained.

You can use InterlockedBitTestAndSet() function (see http://msdn.microsoft.com/en-us/library/windows/apps/ms683549(v=vs.85).aspx). This uses intrinsics to interlock the instruction at the machine level, and achieves what you want. Like all Interlocks it (necessarily) creates a memory fence though, so it costs tens of cycles.
 
Still am curious this idea with dummy read. Do you know would it work at least for applications ensured to run on singled CPU?

Precisely, would this work?
Code:
MOV ax, [ebx + n]
loop:
BTS WORD PTR [ebx + n]
JC loop
Ok, here is one instruction between read and BTS, but still this would be strange to cache could forget a value so fast.

Like all Interlocks it (necessarily) creates a memory fence though, so it costs tens of cycles.
I've read MFENCE costs also something like 100 cycles. Or could it be it was old information?

Hmm, or maybe I should use a fence in this code I posted? I am still not aware what do it exactly do. Does it ensure the data used by previous instructions to be load to the cache before next instruction?
 
Still am curious this idea with dummy read. Do you know would it work at least for applications ensured to run on singled CPU?

Precisely, would this work?
Code:
MOV ax, [ebx + n]
loop:
BTS WORD PTR [ebx + n]
JC loop
Ok, here is one instruction between read and BTS, but still this would be strange to cache could forget a value so fast.


I've read MFENCE costs also something like 100 cycles. Or could it be it was old information?

Hmm, or maybe I should use a fence in this code I posted? I am still not aware what do it exactly do. Does it ensure the data used by previous instructions to be load to the cache before next instruction?

It ensures that in multi-processor systems the test-and-set is atomic with respect to another thread also reading the same WORD at the same time. Without it your code can have two threads execute the same instruction at the same time on the same WORD and both see the same value and same test-and-set result. I'm not 100% certain if ALL multiCORE machines that are NOT multiCPU require it or not - it's a very architecture dependent question. However, the use of compiler intrinsics mean that the compiler should only issue the interlock for CPU architectures that require it I think.

In general your code is unsafe without it.
 
I've read Intel documentation a little bit and it seems it could be safe for multi core single CPU machines, if cores would have single shared cache. But if stands for L1, then it's indeed useless.

Ok, going back to C2C. From what I recall, you've said, there can't be two assignments to one id field at once. But read and write is also a threat. And even read and write, if it would be a single int, would be, if we can't assume atomicity even of single instruction. So are all such operations in critical section?

If I could assume nothing would write references (equivalents of ids) and at once read them, I could reduce the number of locked operations.
 
I've read Intel documentation a little bit and it seems it could be safe for multi core single CPU machines, if cores would have single shared cache. But if stands for L1, then it's indeed useless.

Ok, going back to C2C. From what I recall, you've said, there can't be two assignments to one id field at once. But read and write is also a threat. And even read and write, if it would be a single int, would be, if we can't assume atomicity even of single instruction. So are all such operations in critical section?

If I could assume nothing would write references (equivalents of ids) and at once read them, I could reduce the number of locked operations.

All such accesses (concurrent read/write) are (or should be anyway) either:

1) In critical sections; or
2) Use Interlocked methods; or
3) Cannot occur concurrently for any particular instance (where we do multi-thread, we tend to divide along entity lines so that this is the case for most things - e.g. - one thread deals with one city while another deals with another - the only locking needed is on commonly accessed player-level data and similar [if a write is possible - multiple reads obviously don't require locks if no writes are possible in the concurrent period], rather than on the (mostly accessed) city data. This is why the FFreeTrashArray needs locking, because multiple cities processing at once can cause units to be created, which is a concurrent potential write (can lead to read-write or write-write conflict) on the CvPlayer::m_units instance (just an example of course)
 
So I can assume, there are no simultaneous reads/writes (like id1 = id2 and id2 = id3, or calls getSomething(x, y), setSomething(x, y, something_else) done at once)?

But btw, why cities are done in distinct threads? I thought the game can currently use only one core. I in such situation, multiple threads have sense only when one thread is waiting for something and I don't see, what can city evaluation wait for.

Or when we want multiple results to be achieving more or less simultaneously. But those things are done during turn evaluation, so who cares about simultaneous results?
 
So I can assume, there are no simultaneous reads/writes (like id1 = id2 and id2 = id3, or calls getSomething(x, y), setSomething(x, y, something_else) done at once)?

But btw, why cities are done in distinct threads? I thought the game can currently use only one core. I in such situation, multiple threads have sense only when one thread is waiting for something and I don't see, what can city evaluation wait for.

Or when we want multiple results to be achieving more or less simultaneously. But those things are done during turn evaluation, so who cares about simultaneous results?

V32 added multi-core processing of cities. They are now evaluated in parallel by a thread pool (default size 4 threads). Also property evaluation as always been multi-threaded since AIAndy added it about a year ago. Animal spawning is also multi-threaded (and has been for a while).

No you cannot assume there are no concurrent reads/writes or accessor method calls. What you can assume is that such reads/writes/accessor methods either:

1) Occur within call chains protected at a higher level (e.g. - by critical sections higher up that ensure they are only called by one thread at once); or

2) They are protected in their implementation locally (either by use of Interlocked methods to access the variables, or [more commonly] by critical sections in the accessor methods themselves (like in FFreeListTrashArray)
 
Sorry for asking but i`m really interested :
In general multicore system must use some CPU resources for consistency of data - correct?
Graphical engine will use first core - in anyway you programm dll \ python ?
if i`m right would it be more usefull to offload all possible CPU usage to second core only (if gamer has it)?

Using such approach there will be no need to programm multi-core support and mod will be speeded UP ...imho
 
@Koshling, by simultaneous I meant, exactly at the same point of time. If there are some protecting mechanism like crit. sections, they aren't done exactly at the same time. ;)

@masaykh, when you click a turn graphical engine isn't much useful anymore until your next turn, so actually it is good, that all cores can be used for turn evaluation. And during your turn, there is really not much too be done beyond graphical engine, so there is no point in optimizing it like that.

Probably, it would be even good to release some CPU cycles from graphical engine during turn evaluating. But it is rather out of our possibilities. -- Koshiling, can we somehow slowdown main loop of the g. engine, by reducing number of FPS or something?
 
@masaykh, when you click a turn graphical engine isn't much useful anymore until your next turn, so actually it is good, that all cores can be used for turn evaluation. And during your turn, there is really not much too be done beyond graphical engine, so there is no point in optimizing it like that

mmm.. no. i mean Graphical Engine is beyong our possibility to usefully control it.
It only know about one core, even if you have four of them.

So maybe if we use second core only for all evalutions or most unused (if it can be determined) and dont use multicore instructions at all - can we free some resources in such way?
 
@Koshling, by simultaneous I meant, exactly at the same point of time. If there are some protecting mechanism like crit. sections, they aren't done exactly at the same time. ;)
You're splitting hairs. I mean at the same time relative to the programming model. If you want to talk about physical events occurring in the hardware then regardless of locks the reads/writes do not happen at the same time because they area serialized by latches on the cache line buffers.

@masaykh, when you click a turn graphical engine isn't much useful anymore until your next turn, so actually it is good, that all cores can be used for turn evaluation. And during your turn, there is really not much too be done beyond graphical engine, so there is no point in optimizing it like that.

Probably, it would be even good to release some CPU cycles from graphical engine during turn evaluating. But it is rather out of our possibilities. -- Koshiling, can we somehow slowdown main loop of the g. engine, by reducing number of FPS or something?
Masaykh is quite correct - because we have no control over the graphics engine we are somewhat constrained. Specifically this is what occurs:

1) All computation that has to interact with the external model (i.e.- the engine is serialized (for that interaction) to only occur when the MAIN thread (i.e. - the one the graphics engine knows about) is inside CvGame::update(). Since a call to CvGame::update() spans the entire processing of a turn for an AI this is actually quite a long time, but indeed most background threads (in particular the city processing ones) must all rendezvous back with the main thread before it can be allowed to exit from CvGame::update()

2) Computation that does NOT (directly) interact with the game engine (i.e. - just changes internal state of the DLL's model [subject to suitable locking]) can free-run, but by definition it must be running on a different core to the main thread [or only run when that thread is suspended] (or at least virtual core, in the case of hyper-threaded CPUs)

The net effect is to make the game engine believe everything is happening on one thread (since that's all it can cope with), by having the main thread act as a sort of proxy for any interactions.
 
Don't learn mother children making. I understand what you are saying. If we move "all evaluations" to one core, we are losing power of other cores. Expensive locking instructions are rather rare in compare to ordinary instructions. So even, if we would save 10% (which I do not believe) of the core's cycles, we will loss whole potential of other cores.

Something like this would be possible, if we could split work to be done into groups, that use disjoint subsets of memory. Then we could run each of this groups on distinct core. But I do not know what is the chance for doing it. -- It depends on the code.

Koshling, seriously, do you think there are so many crit. sections and interlocked calls, that we can gain more then 1% speed up with this?

If we would organize code in some good order, we can get an advantage about crit. section not block each other. But we would need to call EnterCritSection and interlocked instructions quite often to get an significant advantage only because not using LOCKs and FANCEs.
 
Don't learn mother children making. I understand what you are saying. If we move "all evaluations" to one core, we are losing power of other cores. Expensive locking instructions are rather rare in compare to ordinary instructions. So even, if we would save 10% (which I do not believe) of the core's cycles, we will loss whole potential of other cores.

Something like this would be possible, if we could split work to be done into groups, that use disjoint subsets of memory. Then we could run each of this groups on distinct core. But I do not know what is the chance for doing it. -- It depends on the code.

Koshling, seriously, do you think there are so many crit. sections and interlocked calls, that we can gain more then 1% speed up with this?

If we would organize code in some good order, we can get an advantage about crit. section not block each other. But we would need to call EnterCritSection and interlocked instructions quite often to get an significant advantage only because not using LOCKs and FANCEs.

The speedup in city processing when I added it in V32 was over 50%, so yes.
 
@Koshiling, I firstly thought masaykh says only about exchanging BeginCritSection with unlocked CMPXCHG, JZ and moving whole non engine evaluation to another core. -- This could help only, if the power consumed currently by threads ran on other cores is small. But it would certainly not flow from resigning from LOCK and FENCE, but from moving more work to another core. -- Still I am not sure now. Maybe he meant something else.
 
@Koshiling, I firstly thought masaykh says only about exchanging BeginCritSection with unlocked CMPXCHG, JZ and moving whole non engine evaluation to another core. -- This could help only, if the power consumed currently by threads ran on other cores is small. But it would certainly not flow from resigning from LOCK and FENCE, but from moving more work to another core. -- Still I am not sure now. Maybe he meant something else.

You right, "moving whole non engine evaluation to another core" it is excatly that i mean.

can you suggest accurate way to see how much time it take to evalute CPU usage by non-engine ? i use process explorer to see how much cycles it take but it show only overall for process...i recheck it
 
You right, "moving whole non engine evaluation to another core" it is excatly that i mean.

can you suggest accurate way to see how much time it take to evalute CPU usage by non-engine ? i use process explorer to see how much cycles it take but it show only overall for process...i recheck it

The built-in profiler gives accurate enough timings. However, moving it to a non-engine core won't work for the most part, because you can only make callbacks to the engine on its thread (or else crashes result), and since much of the DLL evaluation (i.e. - AI turn processing) has to make those callbacks (see plots or uni entities dirty, retrieve text, add messages, call python, ...). To a very large extent that means you can only run the non-engine evaluation code in windows when the engine thinks it should be running (i.e. - while the engine thread is in CvGame::update()). By necessity that means that the game engine itself is doing nothing at that time (it's in CvGame::update() and the engine thinks its executing the AI turn, whether in fact it is doing it, or several threads actually are). For that reason there is little point trying to entirely move AI processing off of the main thread, because the main thread cannot be doing anything anyway while the AI is processing. We'd really need a new game engine to resolve that (or a very extensive asynchronous proxy layer between the DLL game model and the APIs to/from the main engine)
 
Back
Top Bottom