Configuring the Hadoop Distribution Parameters
You can configure parameters related to the Hadoop distribution, such as split size, in the configuration file.
The number of MapReduce jobs that you want to process the input file depends on the split size. The longer the block size, the longer run time for a single job. If you split the input file into multiple parts based on the split size, a separate job processes each part, which improves the run time.
For example, the input file size is 112 MB. If the block size is 128 MB, the HDFS stores the input file in a single block, and a single job processes the file. If you set the split size as 32 MB, the input file is split into 4 parts and 4 jobs process the file.
To configure the Hadoop distribution, add the following parameters to the
section in the configuration file:
- Optional. Name for the configuration that you create.
- Optional. Minimum valid size in bytes to split a file. Default is 0.
MinInputSplitSize parameter overrides the
mapred.min.split.size property of Hadoop when you run a job.
- Optional. Maximum valid size in bytes to split a file.
By default, the split size is equal to the HDFS block size.
MaxInputSplitSize parameter overrides the
mapred.max.split.size property of Hadoop when you run a job.
The following sample code shows the parameters for the Hadoop distribution: