c++ - SSE2 double multiplication slower than with standard multiplication -
i'm wondering why following code sse2 instructions performs multiplication slower standard c++ implementation. here code:
m_win = (double*)_aligned_malloc(size*sizeof(double), 16); __m128d* pdata = (__m128d*)input().data; __m128d* pwin = (__m128d*)m_win; __m128d* pout = (__m128d*)m_output.data; __m128d tmp; int i=0; for(; i<m_size/2;i++) pout[i] = _mm_mul_pd(pdata[i], pwin[i]); the memory m_output.data , input().data has been allocated _aligned_malloc.
the time execute code 2^25 array identical time code (350ms):
for(int i=0;i<m_size;i++) m_output.data[i] = input().data[i] * m_win[i]; how possible? should theoretically take 50% of time, right? or overhead memory transfer simd registers m_output.data array expensive?
if replace line first snippet
pout[i] = _mm_mul_pd(pdata[i], pwin[i]); by
tmp = _mm_mul_pd(pdata[i], pwin[i]); where __m128d tmp; codes executes blazingly fast, less resolution of timer function. because stored in registers , not memory?
and more surprising, if compile in debug mode, sse code takes only 93ms while standard multiplication takes 309ms.
- debug: 93ms (sse2) / 309ms (standard multiplication)
- release: 350ms (sse2) / 350 (standard multiplication)
what's going on here???
i'm using msvc2008 qtcreator 2.2.1 in release mode. here compilter switches release:
cl -c -nologo -zm200 -zc:wchar_t- -o2 -md -gr -ehsc -w3 -w34100 -w34189 and these debug:
cl -c -nologo -zm200 -zc:wchar_t- -zi -mdd -gr -ehsc -w3 -w34100 -w34189 edit regarding release vs debug issue: wanted note profiled code , sse code is infact slower in release mode! confirms somehow hypothesis vs2008 somehow cant handle intrinsics optimizer properly. intel vtune gives me 289ms sse loop in debug , 504ms in release mode. wow... wow...
first of all, vs 2008 bad choice intrisincs tends add many more register moves necessary , in general not optimize (for instance, has issues loop induction variable analysis when sse instructions present.)
so, wild guess compiler generates mulss instructions cpu can trivially reorder , execute in parallel (no dependencies between iterations) while intrisincs result in lots of register moves/complex sse code -- might blow trace cache on modern cpus. vs2008 notorious doing it's calculations in registers , guess there hazards cpu cannot skip (like xor reg, move mem->reg, xor, mov mem->reg, mul, mov mem->reg dependency chain while scalar code might move mem->reg, mul mem operand, mov.) should @ generated assembly or try vs 2010 has much better support intrinsincs.
finally, , important: code not compute bound at all, no amount of sse make faster. on each iteration, reading 4 double values , writing two, means flops not problem. in case, you're @ mercy of cache/memory subsystem, , explains variance see. debug multiplication shouldn't faster release; , if see being faster should more runs , check else going on (be careful if cpu supports turbo mode, adds 20% variation.) context switch empties cache might enough in case.
so, overall, test made pretty meaningless , shows memory bound cases there no difference use sse or not. should use sse if there code compute-dense , parallel, , spend lot of time profiler nail down exact location optimize. simple dot product not suitable see performance improvements sse.
Comments
Post a Comment