c++ - SSE2 double multiplication slower than with standard multiplication -

i'm wondering why following code sse2 instructions performs multiplication slower standard c++ implementation. here code:

        m_win = (double*)_aligned_malloc(size*sizeof(double), 16);         __m128d* pdata = (__m128d*)input().data;         __m128d* pwin = (__m128d*)m_win;         __m128d* pout = (__m128d*)m_output.data;         __m128d tmp;         int i=0;         for(; i<m_size/2;i++)             pout[i] = _mm_mul_pd(pdata[i], pwin[i]);

the memory m_output.data , input().data has been allocated _aligned_malloc.

the time execute code 2^25 array identical time code (350ms):

for(int i=0;i<m_size;i++)     m_output.data[i] = input().data[i] * m_win[i];

how possible? should theoretically take 50% of time, right? or overhead memory transfer simd registers m_output.data array expensive?

if replace line first snippet

pout[i] = _mm_mul_pd(pdata[i], pwin[i]);

tmp = _mm_mul_pd(pdata[i], pwin[i]);

where __m128d tmp; codes executes blazingly fast, less resolution of timer function. because stored in registers , not memory?

and more surprising, if compile in debug mode, sse code takes only 93ms while standard multiplication takes 309ms.

debug: 93ms (sse2) / 309ms (standard multiplication)
release: 350ms (sse2) / 350 (standard multiplication)

what's going on here???

i'm using msvc2008 qtcreator 2.2.1 in release mode. here compilter switches release:

cl -c -nologo -zm200 -zc:wchar_t- -o2 -md -gr -ehsc -w3 -w34100 -w34189

and these debug:

cl -c -nologo -zm200 -zc:wchar_t- -zi -mdd -gr -ehsc -w3 -w34100 -w34189

edit regarding release vs debug issue: wanted note profiled code , sse code is infact slower in release mode! confirms somehow hypothesis vs2008 somehow cant handle intrinsics optimizer properly. intel vtune gives me 289ms sse loop in debug , 504ms in release mode. wow... wow...

first of all, vs 2008 bad choice intrisincs tends add many more register moves necessary , in general not optimize (for instance, has issues loop induction variable analysis when sse instructions present.)

so, wild guess compiler generates mulss instructions cpu can trivially reorder , execute in parallel (no dependencies between iterations) while intrisincs result in lots of register moves/complex sse code -- might blow trace cache on modern cpus. vs2008 notorious doing it's calculations in registers , guess there hazards cpu cannot skip (like xor reg, move mem->reg, xor, mov mem->reg, mul, mov mem->reg dependency chain while scalar code might move mem->reg, mul mem operand, mov.) should @ generated assembly or try vs 2010 has much better support intrinsincs.

finally, , important: code not compute bound at all, no amount of sse make faster. on each iteration, reading 4 double values , writing two, means flops not problem. in case, you're @ mercy of cache/memory subsystem, , explains variance see. debug multiplication shouldn't faster release; , if see being faster should more runs , check else going on (be careful if cpu supports turbo mode, adds 20% variation.) context switch empties cache might enough in case.

so, overall, test made pretty meaningless , shows memory bound cases there no difference use sse or not. should use sse if there code compute-dense , parallel, , spend lot of time profiler nail down exact location optimize. simple dot product not suitable see performance improvements sse.

Search This Blog

Barbera

c++ - SSE2 double multiplication slower than with standard multiplication -

Comments

Post a Comment

Popular posts from this blog

c++ - Is it possible to compile a VST on linux? -

java - Output of Eclipse is rubbish -

jquery - Confused with JSON data and normal data in Django ajax request -