Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Back Next

Guidelines and Recommendations for Utilizing Clusters and Storage

Amazon EMR has two cluster types: transient and persistent.

Recommendation for Cluster Architecture

Transient or ephemeral clusters load input data, process the data, store the output results into a persistent data store, and then automatically shut down. Persistent clusters continue to run even after data processing is complete.

The following qualities characterize each cluster type:

Transient clusters: Launch transient or ephemeral clusters for data processing only, then transfer mapping results to S3 storage. Use a script to launch transient clusters for mapping runs and terminate them when data transfer is complete.; For more information, see Ephemeral Clusters.
Persistent clusters: Persistent clusters are always available for processing or storing data that requires quick access. Each cluster node is charged by the second, so costs accumulate quickly

Recommendation: Informatica recommends transient cluster architecture, in which you retain data on a store like S3 or Redshift, and use the EMR cluster only for processing.

Guidelines For Using S3 Storage

Consider the following guidelines for utilizing S3 storage:

Avoid uploading many small files. Instead, upload a smaller number of larger files.

Reducing the number of files stored on Amazon S3 or on HDFS provides better performance when a mapping processes data on Amazon EMR.

Use data compression for data sources and targets. Data compression helps reduce S3 storage costs and bandwidth costs.

Data partitioning helps data optimization, allows the creation of unique buckets of data, and eliminates the need for a data processing job to read the entire data set.

Use the AWS multipart upload feature (--multipart-chunk-size-mb) to upload larger files ( >100 MB) to S3. The default chunk size is 15MB, the minimum allowed chunk size is 5MB, and the maximum is 5GB. Using this feature, Informatica testing showed an improvement of 10-12% when uploading a 77GB file to an S3 bucket.

Rename Saved Search

Table of Contents

Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Guidelines and Recommendations for Utilizing Clusters and Storage

Guidelines and Recommendations for Utilizing Clusters and Storage

Recommendation for Cluster Architecture

Guidelines For Using S3 Storage