Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Back Next

MapR Issue: Slowness Due to File Copy from tmp to Target Directory

Consider the following troubleshooting tips for the MapR distribution.

Copying a file from the tmp directory to the target

While a job is in progress, Spark creates an intermediate target directory to which all processing tasks write data. Each task creates its own temporary file in this intermediate folder.

Source splits are determined based on the dfs.blockSize. For example, a 256 MB block size results in 4 splits per GB of data. The number of files written to the tmp directory depends either on the source partitions, in the case of non-shuffle mappings, or on the number of shuffle partitions. Due to a limitation on MapR, whenever there are a large number of partitions, the sequential copying of files from the temporary directory to the target might result in slow performance. The Yarn logs contain a warning message for each copied file.

For example, you might see a warning like the following message:

maprfs:///user/hive/warehouse/tpch_text_1000.db/lineitem_tgt/.hive-staging_hive_2018-12-20_13-13-45_870_781706737290042111-1/-ext-10000/part-00953-4b1fe48f-a41b-4525-ae7a-662d08cb5963-c000 to maprfs:/user/hive/warehouse/tpch_text_1000.db/lineitem_tgt/part-00953-4b1fe48f-a41b-4525-ae7a-662d08cb5963-c000 because HDFS encryption zones are different

Rename Saved Search

Table of Contents

Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

MapR Issue: Slowness Due to File Copy from tmp to Target Directory

MapR Issue: Slowness Due to File Copy from tmp to Target Directory

Copying a file from the tmp directory to the target