You can configure parameters related to the Hadoop distribution, such as split size, in the configuration file.
The number of MapReduce jobs that you want to process the input file depends on the split size. The longer the block size, the longer run time for a single job. If you split the input file into multiple parts based on the split size, a separate job processes each part, which improves the run time.
For example, the input file size is 112 MB. If the block size is 128 MB, the HDFS stores the input file in a single block, and a single job processes the file. If you set the split size as 32 MB, the input file is split into 4 parts and 4 jobs process the file.
To configure the Hadoop distribution, add the following parameters to the
HadoopConfiguration
section in the configuration file:
JobName
Optional. Name for the configuration that you create.
MinInputSplitSize
Optional. Minimum valid size in bytes to split a file. Default is 0.
The
MinInputSplitSize
parameter overrides the
mapred.min.split.size
property of Hadoop when you run a job.
MaxInputSplitSize
Optional. Maximum valid size in bytes to split a file.
By default, the split size is equal to the HDFS block size.
The
MaxInputSplitSize
parameter overrides the
mapred.max.split.size
property of Hadoop when you run a job.
The following sample code shows the parameters for the Hadoop distribution: