Mass Ingestion Guide

Back Next

Connections

When you run mass ingestion jobs, the mass ingestion components use the following connections:

JDBC: A JDBC connection accesses the tables in a relational database in a mass ingestion job.

The source connection that you use for a mass ingestion job must be a JDBC connection. For example, to access an Oracle schema, you must configure a JDBC connection that uses an Oracle driver to connect to an Oracle database. You cannot use an Oracle connection.
Sqoop: When you configure a JDBC connection with Sqoop arguments, tasks are divided between JDBC and Sqoop. JDBC is used to import metadata from the relational database, while Sqoop reads the data.

If you use an incremental load to ingest data using a Sqoop connection, the Mass Ingestion Service leverages Sqoop's incremental import mode. When the Mass Ingestion Service configures the filter for incremental data, the filter is pushed down to the Sqoop source.

If you use a Sqoop connection, consider the following limitations:
A source table cannot be ingested using a Sqoop connection if the table contains special characters in the table metadata.
Blob data types cannot be ingested using a Sqoop connection.
Hadoop: A Hadoop connection allows the Data Integration Service to push mass ingestion jobs to the Hadoop environment where the jobs run on the Spark engine.
Hive: A Hive connection accesses Hive data and allows a mass ingestion job to write Hive data to a Hive target.
HDFS: An HDFS connection accesses data on the Hadoop cluster to allow a mass ingestion job to write flat-file data to the cluster.