cuda - Correct Effective Bandwith calculations of y = Ax+b? -
i calculate bandwith of matrix vector multiplication , addition: (assume = m times n big) y = a*x +b
but bit confused read , write count number of bytes read global memory:
is effective bandwith:
bytesreadwrite = m*n (for reading a) + n(for read x) + m (for read b) + m(for write y)
or it
bytesreadwrite = m*n (for reading a) + m*n (for read x) + m (for read b) + m(for write y)
m*n x because read once whole x each row (also if work shared memory, have read once whole x vector per row)
does have advice of right choice? dont really...
i tend use first calculation why? make sense?
thanks lot!!!
it's none of above. in terms of memory bandwidth, modern processors load of items operated on once level 2 cache, , operate on them there, after results written out memory items changed. effectively, bandwidth sum total size of elements involved. note: oversimplification, because doesn't take account effects of streaming, not mention memory pagination. streaming, it's not uncommon have single matrix operate on large set of data (3d graphics calculations, example); in case, matrix gets loaded l2 cache (and presumably reasonably optimized code registers there) once, , vectors loaded through. once again, model isn't complete without understanding of modern memory paging techniques; there's gigantic difference in above if matrix , vectors stored in different memory pages, example; not mention serious optimizations in packing vectors "streaming" l2 cache. , then, that's assuming cpu model of performing matrix math; bringing gpu picture changes things once again dramatically.
Comments
Post a Comment