Spark Engine Optimization for Sqoop Pass-Through Mappings
Spark Engine Optimization for Sqoop Pass-Through Mappings
When you run a pass-through mapping with a Sqoop source on the Spark engine, the Data Integration Service optimizes mapping performance in the following scenarios:
You write data to a Hive target that uses the Text format.
You write data to a Hive target that was created with a custom DDL query.
You write data to an Hive target that is either partitioned with a custom DDL query or partitioned and bucketed with a custom DDL query.
You write data to an existing Hive target that is both partitioned and bucketed.
You write data to an HDFS target that uses the Flat, Avro, or Parquet format.
If you want to disable the performance optimization, set the --infaoptimize argument to false in the JDBC connection or Sqoop mapping. For example, if you see data type issues after you run an optimized Sqoop mapping, you can disable the performance optimization.
Use the following syntax:
--infaoptimize false
Rules and Guidelines for Sqoop Spark Engine Optimization
Consider the following rules and guidelines when you run Sqoop mappings on the Spark engine:
The Data Integration Service does not optimize mapping performance in the following scenarios:
There are unconnected ports between the source and target in the mapping.
The data types of the source and target in the mapping do not match.
You write data to an existing Hive target table that is either partitioned or bucketed.
You run a mapping on an Azure HDInsight cluster that uses WASB to write data to an HDFS complex file target of the Parquet format.
The date or time data type in the Sqoop source is mapped to the timestamp data type in the Hive target.
The Sqoop source contains a decimal column and the target is a complex file.
If you configure Hive-specific Sqoop arguments to write data to a Hive target, Sqoop ignores the arguments.
If you configure a delimiter for a Hive target table that is different from the default delimiter, Sqoop ignores the delimiter.