Table of Contents

Search

  1. Preface
  2. Introduction to Mass Ingestion
  3. Prepare
  4. Create
  5. Deploy
  6. Run
  7. Monitor
  8. Appendix A: infacmd mi Command Reference

Mass Ingestion Guide

Mass Ingestion Guide

Hive Target

Hive Target

Configure a Hive target to ingest source data to Hive target tables.
When you configure the mass ingestion specification to ingest data to a Hive target, you configure a Hive connection and Hive properties to define the target.
You can ingest data to an internal or external Hive table. Internal Hive tables are managed by Hive. External Hive tables are unmanaged tables. You can specify an external location for an external Hive table such as Amazon S3, Microsoft Azure Data Lake Store, or HBase.
If you enable incremental load in the definition of the mass ingestion specification, you must configure incremental load options for the Hive target to select a mode to ingest the data. You can also choose to propagate schema changes on the source.
The following image shows the Target page for a Hive target:
This screenshot shows the Target page of the mass ingestion specification for a Hive target. On the Target page, you can configure properties to define the Hive target. The bottom of the page shows a section for Incremental Load Options. In the top-right corner, you have the option Next to go to the next page, or the button X to discard the specification.
The following table describes the properties that you can configure to define the Hive target:
Property
Description
Target Connection
Required. The Hive connection used to find the Hive storage target.
If changes are made to the available Hive connections, refresh the browser or log out and log back in to the Mass Ingestion tool.
Target Schema
Required. The schema that defines the target tables.
Target Table Prefix
The prefix added to the names of the target tables.
Enter a string. You can enter alphanumeric and underscore characters. The prefix is not case sensitive.
Target Table Suffix
The suffix added to the names of the target tables.
Enter a string. You can enter alphanumeric and underscore characters. The prefix is not case sensitive.
Hive Options
Select this option to configure the Hive target location.
DDL Query
Select this option to configure a custom DDL query that defines how data from the source tables is loaded to the target tables.
Storage Format
Required. The storage format of the target tables. You can select Cluster default, Text, Avro, Parquet, or ORC. Default is Cluster default.
If you select Cluster default, the specification uses the default storage format on the Hadoop cluster.
External Table
Select this option if the table is external.
External Location
The external location of the Hive target. By default, tables are written to the default Hive warehouse directory.
A sub-directory is created under the specified external location for each source that is ingested. For example, you can enter
/temp
. A source table named
PRODUCT
is ingested to the external location
/temp/PRODUCT/
.
Mode
Required if you enable incremental load. Select Append or Overwrite. Append mode appends the incremental data to the target. Overwrite mode overwrites the data in the target with the incremental data. Default is Append.
Propagate schema changes on the source
Optional. If new columns are added to the source tables or existing columns are changed, the changes are propagated to the target tables.
Configure partition and cluster properties for specific target tables when you configure the transformation override.
When you ingest to a Hive target, consider the following guidelines:
  • If a source table is ingested to a Hive target and the name of the source table contains a reserved keyword on Hive, the data in the source table is ingested to a target table that has a randomly-generated name.
  • A source table cannot be ingested into Hive if the table metadata uses UTF-8 characters. To resolve the issue, configure the Hive metastore for UTF-8 data processing.
  • A source table cannot be ingested to an Avro file in a Hive target if the source table contains a column with a timestamp data type or the incremental load is configured with a timestamp key. To ingest timestamp data to an Avro file, the third-party Hive JDBC driver must have a Hive version higher than 1.1.
  • When you run a full load to ingest data to a Hive target in an external location, all rows in the source table are added to the target Hive table. For example, if the source table contains 500 rows and you run a full load twice, the Hive table contains 1000 rows. To reset the table, you must clear the data in the external location.

0 COMMENTS

We’d like to hear from you!