Table of Contents

Search

  1. Abstract
  2. Supported Versions
  3. Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Case Study: Amazon EMR Auto-Scaling

Case Study: Amazon EMR Auto-Scaling

The following case study analyzes the performance of a cluster workflow that is deployed to an ephemeral Amazon EMR cluster.

Test Setup

The study uses the following cluster workflow:
This image shows a cluster workflow in the Developer tool. The workflow contains the following tasks: Create_Cluster, Delete_Connection, Create_Connection, Mapping_Task_Run_1, Mapping_Task_Run_2, Mapping_Task_Run_3, and Delete_Cluster.
The cluster workflow contains three mapping tasks,
Task 1
,
Task 2
, and
Task 3
. The mapping tasks are identical mappings that process the same data volume. The tasks are executed sequentially.
The workflow is submitted to an ephemeral Amazon EMR cluster that employs auto-scaling. The auto-scaling rules have a minimum capacity set to 4 nodes and a maximum capacity set to 10 nodes. The auto-scaling rules add nodes to the cluster in increments of 2.
The following image shows the auto-scaling policy:
This image shows an auto-scaling configuration. The configuration properties are listed in a hierarchy. The MinCapacity and MaxCapacity values are nested under Constraints. MinCapacity is set to 4, and MaxCapacity is set to 10.

Environment

Chipset
Intel® Xeon® Processor X5675 @ 3.2 GHz
Cores
2 x 6 cores
Memory
256 GB
Operating system
Red Hat Enterprise Linux 7.0
Hadoop distribution
Cloudera Enterprise 5.11.1
Hadoop cluster
7 nodes

Performance Chart

The following performance chart shows a timeline of the workflow execution on the Amazon EMR cluster:
This image shows a timeline of the mapping tasks on the Amazon EMR cluster. At the beginning of the timeline, there is a 7 minute period where no tasks are executed. This period is followed by the execution of Task 1 which is around 12.5 minutes long, the execution of Task 2 which is around 8 minutes long, the execution of Task 3 which is around 6.5 minutes long, and a 2 minute period at the end that deletes the cluster.
The performance chart shows the following events:
  • Creating an ephemeral cluster took 7 minutes.
  • The first mapping task ran on 4 data nodes and took around 13 minutes to complete.
  • While the first task ran, the auto-scaling policy was prompted to add 2 data nodes to the cluster, but it took around 8 minutes for the additionally commissioned nodes to be available for processing.
  • By the time that the additional nodes became available, the execution of the first mapping task was completed.
The second mapping task leveraged the additional nodes and ran on 6 data nodes. The additional computational resources reduced the execution run time from around 13 minutes to around 8 minutes.

Conclusions

Auto-scaling rules for ephemeral clusters on AWS can be defined in multiples of 5 minutes. Once the auto-scaling rules are prompted, it can take between 5 to 10 minutes for additionally commissioned nodes to become available.
Based on these observations, Informatica recommends that you implement auto-scaling rules only when processing large volumes of data, such as Spark applications that run mapping jobs over multiple iterations.

0 COMMENTS

We’d like to hear from you!