c++ - Assembly Performance Tuning -


i writing compiler (more fun else), want try make efficient possible. example told on intel architecture use of register other eax performing math incurs cost (presumably because swaps eax actual piece of math). here @ least 1 source states possibility (http://www.swansontec.com/sregisters.html).

i verify , measure these differences in performance characteristics. thus, have written program in c++:

#include "stdafx.h" #include <intrin.h> #include <iostream>  using namespace std;  int _tmain(int argc, _tchar* argv[]) {     __int64 startval;     __int64 stopval;     unsigned int value; // keep value keep being optomized out      startval = __rdtsc(); // cpu tick counter using assembly rdtsc opcode      // simple math: = (a << 3) + 0x0054e9     _asm {         mov ebx, 0x1e532 // seed         shl ebx, 3         add ebx, 0x0054e9         mov value, ebx     }      stopval = __rdtsc();     __int64 val = (stopval - startval);     cout << "result: " << value << " -> " << val << endl;      int i;     cin >> i;      return 0; } 

i tried code swapping eax , ebx i'm not getting "stable" number. hope test deterministic (the same number every time) because it's short it's unlikely context switch occurring during test. stands there no statistical difference number fluctuates wildly impossible make determination. if take large number of samples number still impossibly varied.

i'd test xor eax, eax vs mov eax, 0, have same problem.

is there way these kinds of performance tests on windows (or anywhere else)? when used program z80 ti-calc had tool select assembly , tell me how many clock cycles execute code -- can not done our new-fangeled modern processors?

edit: there lot of answers indicating run loop million times. clarify, makes things worse. cpu more context switch , test becomes testing.

to have hope of repeatable, determinstic timing @ level rdtsc gives, need take steps. first, rdtsc not serializing instruction, can executed out of order, render meaningless in snippet 1 above.

you want use serializing instruction, rdtsc, code in question, serializing instruction, , second rdtsc.

nearly serializing instruction available in user mode cpuid. that, however, adds 1 more minor wrinkle: cpuid documented intel requiring varying amounts of time execute -- first couple of executions can slower others.

as such, normal timing sequence code this:

xor eax, eax cpuid xor eax, eax cpuid xor eax, eax cpuid            ; intel says third execution, timing stable. rdtsc            ; read clock push eax         ; save start time push edx      mov ebx, 0x1e532 // seed // execute test sequence     shl ebx, 3     add ebx, 0x0054e9     mov value, ebx  xor eax, eax      ; serialize cpuid    rdtsc             ; end time pop ecx           ; start time pop ebp sub eax, ebp      ; find end-start sbb edx, ecx 

we're starting close, there's on last point that's difficult deal using inline code on compilers: there can effects crossing cache lines, want force code aligned 16-byte (paragraph) boundary. decent assembler support that, inline assembly in compiler won't.

having said that, think you're wasting time. can guess, i've done fair amount of timing @ level, , i'm quite you've heard outright myth. in reality, recent x86 cpus use set of called "rename registers". make long story short, means name use register doesn't matter -- cpu has larger set of registers (e.g., around 40 intel) uses actual operations, putting value in ebx vs. eax has little effect on register cpu going use internally. either mapped rename register, depending on rename registers happen free when instruction sequence starts.


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -