Rules and Parameters to Improve
Hadoop Job Performance:
Default values are of Hortonworks
Hadoop 2.0, please check configuration files for other values.
io.sort.mb: Buffer size for sorting
Default Size: 100 MB
Description: This is the total
memory available for sorting per node.
Suggestion 1: This should be
around 10 times the value of io.sort.factor. For large jobs (jobs in which map
output is very large), it value
should be increased (considering
total memory available to each node) so that lesser spill to the disk happens.
Suggestion 2 : 2x block size.
To avoid a "task out of
memory" error, set io.sort.mb to a value greater than
0.25*mapred.child.java.opts and less than 0.5*mapred.child.java.opts.
dfs.blocksize: File system block size
Default Size: 128 MB
Description: If Input data size =
2000 GB, dfs.block.size=256 MB then minimum no. of maps= (2000*1024)/256 =
8,000 Maps
Suggestions: If you have less
resources and large set of data then dfs.block.size should be more because of
the considerable amount of overhead for map task creation.
For large number of small files:
If it is less than the block size than aggregate it.
Advantage: This can potentially
make it possible for client to read/write more data without interacting with
the Namenode, and it also
reduces the metadata size of the
Namenode, reducing Namenode load (this can be an important consideration for
extremely large file
systems).
io.sort.factor: Stream merge factor
Default Size: 10
Description: Increase in its
value provide help to reducer because the rear-most collection of streams
(equal to the value of io.sort.factor) are directly sent to the reduce
function without merging and thus saving time in merging.
For the job that has large number
of maps and map’s output is also large, it’s value should be increased such
that it’s value remains to
the one-tenth of the value of
io.sort.mb.
mapred.compress.map.output: Map Output Compression
Default Value: False
Description: For the large jobs
and if the data is splittable, it’s value should be set true because of Disk IO
is usually the performance
bottleneck and thus data transfer
between nodes will be fast.
Recommend user: LZO
If you are emitting sequence
files for your output, then you can set the mapred.output.compression.type
property to control the type of
compression to use. The default
is RECORD, which compresses individual records. Changing this to BLOCK, which
compresses groups of records, is recommended since
it compresses better.
mapred.map.tasks.speculative.execution: Map/Reduce task speculative
execution
Default value: True
Description: On a busy cluster
speculative execution can reduce overall throughput, since redundant tasks are
being executed in an
attempt to bring down the
execution time for a single job.
Suggestions: large jobs where
average task completion time is significant (> 1 hr) due to complex and
large calculations and high
throughput is required the
speculative execution should be set to false.
If high priority or sensitive
task, it should be TRUE.
mapreduce.reduce.shuffle.parallelcopies
Default value: 5
Description: Number of threads to
copy map outputs to the Reducer.
Suggestion: It’s value should be
the square root of the number of nodes in your cluster.
Suggestion: large jobs (the jobs
in which map output is very large), value of this property can be increased
keeping in mind that it will increase the total CPU usage.
Other Performance Parameters:
CombineFileInputFormat class in the
org.apace.hadoop.mapreduce.lib.input package: Emphasize user to aggregate large
no of small files so that no. of mappers can
be reduced (Recommended especially when there are large no. of small files and
going to be processed more than once).
Use combiners whenever it is possible: Combiner is a mini reducer on the map side. It reduces the amount of data sent to the reducers. Doesn’t require to be in the same class as reducers.
Emphasize customer to use reducer in such a way that each of them generates at least 1 GB of output. This will increase the performance, as number of reducers will be less.
Number of task slots per node:
Mappers:Reducers = 4:3
Map Slots /Node= 0.75 * Total no.
of core
Reduce Slots/Node = 0.50 * Total
no of core
Feel free to contact me on:
priyanks@asu.edu
No comments:
Post a Comment