Table of Contents

Search

  1. Abstract
  2. Supported Versions
  3. Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Performance Best Practices

Performance Best Practices

To achieve the best performance for Big Data Management on the AWS cloud, implement the following practices:
  • Create the Informatica domain on an EC2 instances in the same region as the EMR cluster.
  • Allocate 90% of CPU vCores and memory in yarn-site.xml when spawning an EMR cluster.
    For example, for an instance type of m4.4xLarge, with 32 vCores (90% = 29 vCores) and 64 GB memory (90% = ~58 GB):
    [{"Classification":"yarn-site","Properties":{"yarn.nodemanager.resource.cpu-vcores":"32","yarn.nodemanager.resource.memory-mb":"58000","yarn.scheduler.maximum-allocation-mb":"16384","yarn.scheduler.minimum-allocation-mb":"256","yarn.nodemanager.vmem-check-enabled":"false"},"Configurations":[]}
  • Because HDFS data durability is not guaranteed, always use S3 buckets as persistent data storage.
  • With data residing in S3 buckets, the EMR cluster can be terminated after the job is completed, providing significant cost savings.
  • Locate S3 storage in the same region as that of the EMR cluster. Cross-region access is 1.5x to 4x slower. To write to an S3 bucket located in another region, enable cross region replication for S3 buckets.
  • If writing to an S3 bucket is slow, use a data copying utility like S3DistCp to move data from HDFS to S3.
  • Spark shuffle service is enabled by default if Spark is added as an application during EMR cluster creation.
  • To run Spark jobs, enable dynamic allocation parameter in hadoopEnv.properties in the following path on the Data Integration Service node in EC2:
    $INFA_HOME/services/shared/hadoop/<Hadoop distribution>/infaConf
  • For large volume data processing, set device ring parameters for EMR core nodes to max level. The default setting for Rx is 512 and Tx is 1024.

0 COMMENTS

We’d like to hear from you!