configuration - General Method for Determining Hadoop Conf Settings on a Single Node Cluster -
i wondering how best determine appropriate numbers of map , reduce tasks , corresponding maximum size of jvm heap? new hadoop these properties set in mapred-site.xml file. there general formula can follow based on number of (virtual) cores , ram?
in response, consider various additional hadoop processes created before/during job processing , impact on ram usage (see: https://forums.aws.amazon.com/thread.jspa?threadid=49024)
how answer change when shifting single machine cluster 2 machine cluster?
thanks,
setjmp
time has passed , no 1 has tried formulate answer. put forth ideas in hope others point out flaws if exist.
the important thing in configuring hadoop not allow many resources consumed; jobs fail , exceptions not helpful in determining went wrong. particularly memory resource cause immediate crash, , pointed out question jvm may try request unnecessary amount of memory.
we must account processes other map , reduce (like sorting occurs between map , reduce). unfortunately, no 1 has come forward proposal of how many processes may exist @ same time.
so here proposal. if number of mappers m , number of reducers r, , total virtual ram on box g. allocating g/(2*m + r) amount of ram each process. factor of 2 assumes there 1 process sorting output of each map process or performing other supporting work. ensure 2*m + r < p, p number of processors (consider hyper-threading available in computing p) on box prevent context switch.
so far haven't taken down box approach.
Comments
Post a Comment