Spark Engine Optimization for Sqoop Pass-Through Mappings
Spark Engine Optimization for Sqoop Pass-Through Mappings
When you run a Sqoop pass-through mapping on the Spark engine, the Data Integration Service optimizes mapping performance in the following scenarios:
You read data from a Sqoop source and write data to a Hive target that uses the Text format.
You read data from a Sqoop source and write data to an HDFS target that uses the Flat, Avro, or Parquet format.
If you want to disable the performance optimization, set the --infaoptimize argument to false in the JDBC connection or Sqoop mapping. For example, if you see data type issues after you run an optimized Sqoop mapping, you can disable the performance optimization.
Use the following syntax:
--infaoptimize false
Rules and Guidelines for Sqoop Spark Engine Optimization
Consider the following rules and guidelines when you run Sqoop mappings on the Spark engine:
The Data Integration Service does not optimize mapping performance in the following scenarios:
There are unconnected ports between the source and target in the mapping.
The data types of the source and target in the mapping do not match.
You write data to a partitioned Hive target table.
You run a mapping on an Azure HDInsight cluster that uses WASB to write data to an HDFS complex file target of the Parquet format.
If you configure Hive-specific Sqoop arguments to write data to a Hive target, Sqoop ignores the arguments.
If you configure a delimiter for a Hive target table that is different from the default delimiter, Sqoop ignores the delimiter.