Amazon Redshift Connector Best Practices

Amazon Redshift Connector Best Practices

Batch Data Integration

Batch Data Integration

When the data warehouse is in the same network as the source data, the data flow patterns can involve multiple interactions with the data warehouse.
Even though data are loaded to the data warehouse in batches, data are loaded to the data warehouse in a record-level interaction such as lookups or customized update statements. With the data warehouse on cloud, it is important to address the record-level interactions to ensure if you can perform read or write operation in bulk. As Amazon Redshift is best suited for bulk insert, update or read operation, you must design the mapping to perform such bulk operations.
The recommended data load patterns for the data warehouse involves staging of data from the source first by transforming the data before or after loading (ETL or ELT) to the staging location. Informatica conforms to the Amazon Redshift best practices and implements several of the recommendations in the Amazon Redshift Connector, PowerExchange for Amazon Redshift adapter, and PowerExchange for Amazon Redshift for PowerCenter adapter.
The data fetched from on-premises and cloud sources are staged in Amazon Redshift mostly with few transformations such as filters that makes sure that only the required data are fetched. By default, data are staged on Amazon S3. However, Informatica recommends that you stage the data to Amazon Redshift directly.
Note that the staging of data to Amazon S3 is an internal step performed to follow the AWS recommendations for performance. In this case, we are discussing about the logical staging of source data in an Amazon Redshift table. You can even choose to use an SQL query when you load the data. Such staging is performed by splitting the data into multiple threads and loading the data to the staging area in parallel. Informatica also uses the node and slice information to configure the number of threads.
For all source data that resides on cloud, you can use the Informatica Secure Agent. For any data that resides inside the firewall, Informatica recommends that you download the Informatica Secure Agent. Once the data is in the staging area, depending on the design, the data can be either processed through an intermediate steps or loaded directly into the data warehouse table. These steps can be designed using Mapping Designer and at the run time. You can choose to run the mapping either using Informatica Secure Agent (ETL) or Amazon Redshift SQL (ELT).
The following image illustrates the batch data integration:
While the image shows an Amazon Redshift staging storage and an intermediate Amazon Redshift storage, the staging of the data largely depends on the data transformation patterns. In most of the cases, you might need to stage the data in one of the staging locations.

0 COMMENTS

We’d like to hear from you!