Table of Contents

Search

  1. Preface
  2. Introduction to Test Data Management
  3. Test Data Manager
  4. Projects
  5. Policies
  6. Data Discovery
  7. Creating a Data Subset
  8. Performing a Data Masking Operation
  9. Data Masking Techniques and Parameters
  10. Plans and Workflows
  11. Monitor
  12. Reports
  13. ilmcmd
  14. Data Type Reference
  15. Data Type Reference for Hadoop

Hive and HDFS Data Sources

Hive and HDFS Data Sources

You can perform data movement, data domain discovery, and data masking operations on Hive and Hadoop Distributed File System (HDFS) data sources.
You can use Hive and HDFS connections in a Hadoop plan.
You can create Hive and HDFS connections in Test Data Manager, and import the Hadoop data sources in to a project. In a Hadoop plan, you can select Hive and HDFS connections as source, target, or both.
You must configure a cluster configuration in the Administrator tool before you perform TDM operations on Hive and HDFS sources. A cluster configuration is an object that contains configuration information about the Hadoop cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop environment.
The Hive database schema might contain temporary junk tables that are created when you run a mapping. The following sample formats are the junk tables in a Hive database schema:
w1413372528_infa_generatedsource_1_alpha_check
w1413372528_write_employee1_group_cast_alpha_check
Ensure that you do not select any temporary tables when you import data sources.
You can create a Hadoop plan to move data from Hive, HDFS, flat files, or relational databases such as Oracle, DB2, ODBC-Sybase, and ODBC-Microsoft SQL Server into Hive or HDFS targets. You can also create a Hadoop plan when you want to move data between Hive and HDFS sources and targets. If the source is HDFS, you can move data to a Hive or an HDFS target. If the source is Hive, you can move data to a Hive or an HDFS target. You can extract data from Hive and HDFS to a flat file in a Hadoop plan.
To run a Hadoop plan, TDM uses Data Integration Service that is configured for pushdown optimization. When you generate and run the Hadoop plan, TDM generates the mappings and the Data Integration Service pushes the mappings to the Hadoop cluster to improve the performance. You can use a Blaze or a Hive execution engine to run Hadoop mappings. When you select an HDFS target connection, you can use Avro or Parquet resource formats to mask data.

Hive Inplace Masking

You can perform an inplace masking operation on Hive data sources. Use a Hive or a Spark execution engine to run the mappings in the cluster. You can use all the data masking techniques while you perform an inplace masking on Hive data sources with a Hive execution engine. When you use a Spark engine, you cannot perform shuffle and substitution masking.
Before you perform an inplace masking operation on Hive data sources, you must take a backup of source tables. If the data movement from staging to source tables fails, TDM truncates source tables and there might be loss of data.