configuration - General Method for Determining Hadoop Conf Settings on a Single Node Cluster -


i wondering how best determine appropriate numbers of map , reduce tasks , corresponding maximum size of jvm heap? new hadoop these properties set in mapred-site.xml file. there general formula can follow based on number of (virtual) cores , ram?

in response, consider various additional hadoop processes created before/during job processing , impact on ram usage (see: https://forums.aws.amazon.com/thread.jspa?threadid=49024)

how answer change when shifting single machine cluster 2 machine cluster?

thanks,

setjmp

time has passed , no 1 has tried formulate answer. put forth ideas in hope others point out flaws if exist.

the important thing in configuring hadoop not allow many resources consumed; jobs fail , exceptions not helpful in determining went wrong. particularly memory resource cause immediate crash, , pointed out question jvm may try request unnecessary amount of memory.

we must account processes other map , reduce (like sorting occurs between map , reduce). unfortunately, no 1 has come forward proposal of how many processes may exist @ same time.

so here proposal. if number of mappers m , number of reducers r, , total virtual ram on box g. allocating g/(2*m + r) amount of ram each process. factor of 2 assumes there 1 process sorting output of each map process or performing other supporting work. ensure 2*m + r < p, p number of processors (consider hyper-threading available in computing p) on box prevent context switch.

so far haven't taken down box approach.


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -