Many organizations want to create data lakes and enterprise data warehouses on Hadoop clusters to perform near real-time analytics based on business requirements. Building data lakes on a Hadoop cluster requires a one-time initial load from legacy warehouse systems and frequent incremental loads. In most cases, Hive is the preferred analytic store.
Although Hive versions 0.13 and later support transactions, they pose challenges with incremental loads, such as limited ACID compliance and requirements for ORC file formats and bucketed tables.
This article describes various strategies for updating Hive tables to support incremental loads and ensuring that targets are in sync with source systems.
Informatica Big Data Management supports the following methods to perform incremental updates:
Update Strategy transformation
Update Strategy transformation using MERGE statement