DLL development discussions

@Koshling, sure, but I do not say, it is a problem for us. I am wondering, why something like this happens, at all. In my test unit the trasharray is deleted before exit and the tests stops to show statistics, so I could observe it.

At a guess, the runtime heap implementation doesn't release everything, but keeps a pool allocated up to some threshold. That and/or the few remaining allocations (possibly by the runtime itself) have fragmented things making total release impossible. How much are we talking about? If you use a DEBUG runtime I think there are functions to walk the heap and enumerate the blocks allocated still if you feel the need to analyze further.
 
Sad news. The memory statistics are not actually reliable in this case. Indeed using my structure, there is much less of memory allocated by the process. But the excess of memory in the case of TrashArray is actually disposed and stored by the heap for future use. The same probably goes about the heap tests I did before. :rolleyes: If I would only knew it then, I wouldn't put so much effort about this structure.

The structure is good and if tweaked more, it would be probably much nicer to use then TrashArray. But it would only have sense for a new code or if we would benefit from shrinking size of IDInfo-s to one pointer, I guess.

The speed is indeed 3.3-5x greater, but effective memory saving is about 3-4% and 4% in number of page faults, when testing with highly random accesses.
 
Why, of course Balancing Array Linked List. My structure. :p Well, actually it should be Balancing Linked Array, but BLA doesn't sound so cute.

I guess, it will not give anything special here, but I would like to test it in real life.
 
Why, of course Balancing Array Linked List. My structure. :p Well, actually it should be Balancing Linked Array, but BLA doesn't sound so cute.

I guess, it will not give anything special here, but I would like to test it in real life.

Go right ahead please! Feel free to try things experimentally in general.
 
I am wondering about data races. In FFreeListTrashArray there are critical sections everywhere, but from what can see, there is no full protection against races. E.g. when getAt returns the pointer to an object, it is possible for another thread to delete this object, before we would read something. So, does we care about it in the code, or there are actually no data races here?
 
I am wondering about data races. In FFreeListTrashArray there are critical sections everywhere, but from what can see, there is no full protection against races. E.g. when getAt returns the pointer to an object, it is possible for another thread to delete this object, before we would read something. So, does we care about it in the code, or there are actually no data races here?

The code only operates on on uni at a time per thread, and never operates on the same object in different threads (but can concurrently operate on different objects in different threads). Also (currently) unit processing is not parallel at all, though that will be changing in the next couple of months (but still respecting the condition I started this paragraph with)
 
pfffff, it's quite hard to make the structure thread-safe without loosing performance and not using critical sections everywhere. Say me please, is there an assumption of not assigning two colliding IDs at one? Because ID assignment isn't atomic now, performing id1 = id2 and id2 = id3 at once does not look one hundred percent safe.

I actually have a similar problem with my references, as operations on them are heavily non-atomic. :(
 
pfffff, it's quite hard to make the structure thread-safe without loosing performance and not using critical sections everywhere. Say me please, is there an assumption of not assigning two colliding IDs at one? Because ID assignment isn't atomic now, performing id1 = id2 and id2 = id3 at once does not look one hundred percent safe.

I actually have a similar problem with my references, as operations on them are heavily non-atomic. :(

There is indeed an assumption that two threads allocating concurrently will not assign the same id (i.e. - allocation must be thread safe). That's basically why the crit sect is there. Provided you use a critical section initialized with a spin-wait the performance hit should be relatively small (compared to a heap allocation) - i.e. - one initialized using InitializeCriticalSectionAndSpinCount() NOT the basic InitializeCriticalSection(). Such a crit section will not cause a kernal switch if there is no contention (or it becomes uncontended within the spin count), and effectively just uses Interlocked... functions up to the point that the spin count expires. These still have some overhead (they cause a pipeline flush at CPU level) which is large compared to a non-interlocked instruction, but it's very small compared to a heap allocation, so the amortized penalty in protecting an object allocation (which is doing a heap allocation anyway) is small.
 
Well, I'm little terrified about using critical sections and locking the whole structure on each ref1 = ref2; -- where ref1, ref2 are equivalents of ids. I am thinking about inserting a mutex into each index -- index is the headers for an object, and is pointed by the references. But I do not know the implementation of windows' critical sections and am afraid heaving so many crit. sections may be somehow heavy for the system. -- Besides, I do not want to fatten indexes too much, and would like to have 1 byte mutexes.

Or I've been even thinking about 2 mutexes per 1 byte, but am not sure, will I make it with x86 instructions. -- Really, they could develop some smarter instructions up to now. :rolleyes:
 
Well, I'm little terrified about using critical sections and locking the whole structure on each ref1 = ref2; -- where ref1, ref2 are equivalents of ids. I am thinking about inserting a mutex into each index -- index is the headers for an object, and is pointed by the references. But I do not know the implementation of windows' critical sections and am afraid heaving so many crit. sections may be somehow heavy for the system. -- Besides, I do not want to fatten indexes too much, and would like to have 1 byte mutexes.

Or I've been even thinking about 2 mutexes per 1 byte, but am not sure, will I make it with x86 instructions. -- Really, they could develop some smarter instructions up to now. :rolleyes:

Critical section are WAY cheaper than mutexs. Mutexs require a kernal switch on every check. It's about 2 orders of magnitude difference when uncontended.
 
Oh my, but I am not talking about system mutexes. I am thinking about implementing my own actively waiting mutexes with XADD or CMPXCHG instruction. My approach to the structure do not require coping whole array as the trash array, and mostly needs only few instructions per operation to be done in crit. section, so I think active waiting without sleeping the thread will do the job in this case.

Yup, I know I may be overdoing it, but actually it's a fun. I haven't programmed assembler since my high school. :)

Still, ref1 = ref2; using locking mechanism makes me fill sad.
 
Oh my, but I am not talking about system mutexes. I am thinking about implementing my own actively waiting mutexes with XADD or CMPXCHG instruction. My approach to the structure do not require coping whole array as the trash array, and mostly needs only few instructions per operation to be done in crit. section, so I think active waiting without sleeping the thread will do the job in this case.

Yup, I know I may be overdoing it, but actually it's a fun. I haven't programmed assembler since my high school. :)

Still, ref1 = ref2; using locking mechanism makes me fill sad.

Critical sections in Windows are implemented using compare exchange and interlocked increment, and only if there is contention that lasts more than the specified spin wait will they go to the kernel (and at that point essentially be equivalent to a system mutex). You're very unlikely to do better than the critical section implementation, and furthermore that is CPU architecture sensitive, so gets decent performance even on older machines.
 
You're very unlikely to do better than the critical section implementation, and furthermore that is CPU architecture sensitive, so gets decent performance even on older machines.
I disagree. I do not need to increment and check an additional counter which measures time to sleeping.

But what's more important, I do not want additional 24 bytes per each index! (Or even 48, as I am thinking about a feature which would need another lock.)
 
I disagree. I do not need to increment and check an additional counter which measures time to sleeping.

But what's more important, I do not want additional 24 bytes per each index! (Or even 48, as I am thinking about a feature which would need another lock.)

I take your point about the memory, but no need for assembler - just use the compiler intrinsics (InterlockedIncrement(), InterlockedDecrement(), InterlockedCompareExchange() etc.)
 
I take your point about the memory, but no need for assembler - just use the compiler intrinsics (InterlockedIncrement(), InterlockedDecrement(), InterlockedCompareExchange() etc.)
I was not aware about his functions. Thanks. But still, it is said, it requires the value to be 32-bit aligned, and I am using pragma pack 1 for indexes.
 
I was not aware about his functions. Thanks. But still, it is said, it requires the value to be 32-bit aligned, and I am using pragma pack 1 for indexes.

Are the structures in the index arrays not multiples of 32 bits? If they are then byte-packing an array of them saves very little (maximum of 3 bytes across an entire table), but costs an extra cycle or 2 per access if not cached. Might not be a worthwhile packing for such a structure...
 
Are the structures in the index arrays not multiples of 32 bits? If they are then byte-packing an array of them saves very little (maximum of 3 bytes across an entire table), but costs an extra cycle or 2 per access if not cached. Might not be a worthwhile packing for such a structure...
Mostly they are not. In the average case they are 10 bytes. But may be 9, 11 or 12, depending on the maximal block size given by a template parameter.
 
:badcomp:
@Koshling, did you know, since multicore processors, there is no memory atomic instructions even on aligned addresses? Even stupid read or write are not atomic! And I must again analyse everything from the beginning. :(

Or do we assume, the game can, and always will be able to, use only one core at once? -- I have a feeling, it could crash, if not even a single CPU instruction would be atomic, but I can be wrong.
 
Back
Top Bottom