To conduct near real-time analytics in data lakes and enterprise data warehouses, it is necessary to perform frequent incremental loads. In most cases, Hive is the preferred analytic store.
To perform incremental loads on Hive, you must use transactions that are ACID-compliant, support bucketed tables, and support ORC file formats.
The following issues depict some constraints to performing incremental loads:
Cloudera CDH does not endorce ORC file formats and ACID-enabled tables.
Updates are not compatible with partitioned tables on Amazon EMR due to a Hive issue.
To accommodate for these constraints, you can implement alternate strategies to perform incremental loads. One strategy is to perform an ACID merge. You can perform an ACID merge through Hive MERGE statements in an Update Strategy transformation.
When you use MERGE statements in mappings that run on the Spark engine, the statements perform ACID transactions on Hive tables. You can use the statements to efficiently perform record-level INSERT, UPDATE, and DELETE operations.
To use Hive MERGE statements, select the option in the advanced properties of the Update Strategy transformation. You can toggle the option to use a MERGE statement or a traditional update strategy approach. The traditional update strategy approach does not employ ACID transactions.
The following image shows the Hive MERGE option in the advanced properties of an Update Strategy transformation:
When you enable Hive MERGE in an Update Strategy transformation, the following changes in performance occur:
~ 25% decrease in performance for INSERT statements.
~ 40% increase in performance for UPDATE statements.
Comparable performance for DELETE statements.
Based on this analysis, Hive MERGE results in better performance if the update strategy task involves a large number of UPDATE statements. When there is a combination of INSERT, UPDATE, and DELETE statements, Hive MERGE is expected to produce better results than the traditional update strategy approach.
When you use Hive MERGE statements to perform update strategy tasks, the mapping might fail at run time.
The MERGE statement internally performs a map join during transactional operations on Hive tables. Due to this behavior, it is possible to encounter a run-time failure for mapping tasks that contain an Update Strategy transformation and a partitioned bucketed Hive table.
Task failures occur due to the following error:
java.lang.OutOfMemoryError: Java heap space
The following image shows a log file that contains the error:
If you encounter the error, you can use one of the following workarounds:
Option 1. Disable map join.
To disable map join, set
hive.auto.convert.join=false
for the Environment SQL in the Hive connection properties.
Option 2. Reduce the map join threshold.
If the map join threshold has been modified, reduce the threshold. The default value is 10MB. To reduce the value, set
In addition to Hive MERGE, you can use the following alternate strategies to perform incremental loads:
Use a traditional update strategy approach.
Define expressions in an Update Strategy transformation with IIF or DECODE functions to set rules for updating rows.
Because the approach has restrictions on Cloudera CDH and Amazon EMR distributions, you might want to use a different strategy.
Perform updates using the partition merge solution.
Detect, stage, and update only the impacted partitions on the incremental records from upstream systems.
Perform updates using key-value stores.
Use inserts into key-value stores. The inserts are treated as upserts, so they are a convenient choice for handling updates.
For detailed information on these alternate approaches, see the Informatica
®
How-To Library article "Strategies for Incremental Updates on Hive in Big Data Management 10.2.1" on
https://docs.informatica.com/.
Update Strategy Transformations Using Relational Table Targets
To tune the performance of an Update Strategy transformation with JDBC-connected relational table targets, use the following properties:
Spark.JdbcNumPartitions
The number of partitions to write to the relational database.
Each Spark task opens a connection to the database. This might result in too many connections. Use this property to limit the number of open connections to the database.
To calculate the value of Spark.JdbcNumPartitions, use the formula
Min(X,Y)
where:
X = (Data volume in GB) * 4
and
Y = (Max connections allowed on database)
For example, where Data volume = 2 GB, and Max connections allowed on database = 10, the formula is:
Min(8,10)
The value of Spark.JdbcNumPartitions should be 8.
Spark.JdbcBatchSize
Number of rows to be sent to the database target as a batch. Must be an integer. Default: 10000.
When the number of columns/fields is large and results in a large row size, consider reducing the value of this property to ~5000-8000.