Table of Contents

Search

  1. Abstract
  2. Supported Versions
  3. Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Performance Tuning and Sizing Guidelines for Informatica® Big Data Management 10.2.2

Case Study: Amazon EMR Auto-Scaling

Case Study: Amazon EMR Auto-Scaling

The following case study analyzes the performance of a cluster workflow that is deployed to an ephemeral Amazon EMR cluster.

Test Setup

The study uses the following cluster workflow:
This image shows a cluster workflow in the Developer tool. The workflow contains the following tasks: Create_Cluster, Delete_Connection, Create_Connection, Mapping_Task_Run_1, Mapping_Task_Run_2, Mapping_Task_Run_3, and Delete_Cluster.
The cluster workflow contains three mapping tasks,
Task 1
,
Task 2
, and
Task 3
. The mapping tasks are identical mappings that process the same data volume. The tasks are executed sequentially.
The workflow is submitted to an ephemeral Amazon EMR cluster that employs auto-scaling. The auto-scaling rules have a minimum capacity set to 4 nodes and a maximum capacity set to 10 nodes. The auto-scaling rules add nodes to the cluster in increments of 2.
The following image shows the auto-scaling policy:
This image shows an auto-scaling configuration. The configuration properties are listed in a hierarchy. The MinCapacity and MaxCapacity values are nested under Constraints. MinCapacity is set to 4, and MaxCapacity is set to 10.

Environment

Chipset
Intel® Xeon® Processor X5675 @ 3.2 GHz
Cores
2 x 6 cores
Memory
256 GB
Operating system
Red Hat Enterprise Linux 7.0
Hadoop distribution
Cloudera Enterprise 5.11.1
Hadoop cluster
7 nodes

Performance Chart

The following performance chart shows a timeline of the workflow execution on the Amazon EMR cluster:
This image shows a timeline of the mapping tasks on the Amazon EMR cluster. At the beginning of the timeline, there is a 7 minute period where no tasks are executed. This period is followed by the execution of Task 1 which is around 12.5 minutes long, the execution of Task 2 which is around 8 minutes long, the execution of Task 3 which is around 6.5 minutes long, and a 2 minute period at the end that deletes the cluster.
The performance chart shows the following events:
  • Creating an ephemeral cluster took 7 minutes.
  • The first mapping task ran on 4 data nodes and took around 13 minutes to complete.
  • While the first task ran, the auto-scaling policy was prompted to add 2 data nodes to the cluster, but it took around 8 minutes for the additionally commissioned nodes to be available for processing.
  • By the time that the additional nodes became available, the execution of the first mapping task was completed.
The second mapping task leveraged the additional nodes and ran on 6 data nodes. The additional computational resources reduced the execution run time from around 13 minutes to around 8 minutes.

Conclusions

Auto-scaling rules for ephemeral clusters on AWS can be defined in multiples of 5 minutes. Once the auto-scaling rules are prompted, it can take between 5 to 10 minutes for additionally commissioned nodes to become available.
Based on these observations, Informatica recommends that you implement auto-scaling rules only when processing large volumes of data, such as Spark applications that run mapping jobs over multiple iterations.

0 COMMENTS

We’d like to hear from you!