Nightinggale
Deity
- Joined
- Feb 2, 2009
- Messages
- 5,281
I managed to profile using Time Stamp Counter, something, which is available as __rdtsc today, but not in msvc 2003. It turns out that for short arrays (like 3), getTotal use the same amount of cycles with and without the template approach. For a large array (174 specifically), the loop appears to be using around 25% more cycles (though accuracy in the measurement is questionable due to low number, hence noise level). The template approach will however apparently multiply the cycles used with the number of elements in the array, indicating that it's not actually inlining when it pass a certain number of elements. This means the template approach takes more than 150 times as many cycles as the loop when it's adding up the elements, which is actually consistent with the expected delay of 174 function calls.
Conclusion: the template loop unroll approach dies. Maybe it worked when the compiler was new, but it doesn't work on modern CPUs. Also adding 174 elements takes around 27 cycles. Evidently the CPU does manage to unroll the loop at runtime and maximize the gain from out of order executions. There is no need for us to do anything.
Also I need to write a proper implementation of the profiling I used to do this. It can actually measure differences in function calls, which are so fast that clock() will round the time to 0. A proper implementation would be a class, which has start and stop functions, which hides the fact that they have to call asm.
Conclusion: the template loop unroll approach dies. Maybe it worked when the compiler was new, but it doesn't work on modern CPUs. Also adding 174 elements takes around 27 cycles. Evidently the CPU does manage to unroll the loop at runtime and maximize the gain from out of order executions. There is no need for us to do anything.
Also I need to write a proper implementation of the profiling I used to do this. It can actually measure differences in function calls, which are so fast that clock() will round the time to 0. A proper implementation would be a class, which has start and stop functions, which hides the fact that they have to call asm.