Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Back Next

Guidelines for Selecting EC2 Instances for the EMR Cluster

When provisioning Big Data Management on the EMR cluster, you choose from available Amazon EC2 instance types for the core and task nodes. These instance types have varying combinations of CPU, memory, storage and networking capacities. Wide selection of Amazon EC2 instance types make choosing the right instance type for the EMR clusters challenging.

Consider the following factors when selecting EC2 instances:

EC2 instance types

Workload types and use cases

Storage types

Cluster node types

Following a short discussion of these categories of EC2 architecture and mapping types, this section recommends EC2 instance types for each workload type.

EC2 Instance Types

Amazon refers to EC2 instance types by names that represent categories and sizes. For example, the available instances in the “M series” are m1.small, m1.medium, m1.large, and so on. The available instances in the “C series” are c1.medium, c1.xlarge, and so on.

Each type corresponds to an instance configured with a default amount of memory, number of cores, storage capacity and type of storage and other characteristics, along with a price tag representing the cost per hour to use the instance.

Informatica requires at least minimum of 8 CPU VCores and 30 GB memory for the product to function. This minimal configuration is intended for demo scenarios. The m4.2xlarge instance is appropriate.

For production scenarios, the recommended minimum resource requirements is 16 CPU VCores with at least 30 GB memory The c3.4xlarge instance is appropriate.

For larger workloads, Informatica recommends a minimum of 32 CPU VCores and 60 GB of memory. The c3.8xlarge or m4.10xlarge instances are appropriate. You might also consider instances from the new generation, compute-optimized C4 series.

Mapping Workload Types and Use Cases

Mappings can be categorized using the following workload types:

CPU-bound: A mapping that is limited by the power of the EC2 instance's CPU speed is CPU-bound. For example, a mapping that includes pass-through components, expression evaluations, log parsing, and other use cases is called CPU-bound.
I/O-bound: A mapping that is limited by the network's I/O speed is I/O-bound. For example, a mapping that mapping includes aggregations, joins, sorting and ranking, and other components is called I/O-bound.
Mixed: A mixed-type mapping has a combination of CPU-bound and I/O-bound characteristics. For example, a mapping that has a combination of expression functions and cache-based transformations.

Types of Cluster Nodes

Cluster nodes can be categorized using the following types:

Master node: The master node manages the cluster, distributes tasks to core and task nodes, monitors tasks, and monitors cluster health.
Slave node: Slave nodes can be one of the following types:
Core nodes. Core nodes host the persistent data on Hadoop Distributed File System (HDFS) and run Hadoop tasks. Core nodes should be reserved for the capacity that is required until your cluster completes.
Task nodes. Task nodes do not have a persistent data store. They only run Hadoop tasks.; When you create the cluster, AWS chooses one of the EC2 instances as the master node. You can designate the number of core nodes and task nodes as part of the cluster configuration.

Storage Types

Informatica tested the following types of elastic block storage (EBS) volumes:

Throughput-optimized HDD: Throughput-optimized hard disk drives (HDD) are inexpensive magnetic storage, optimized for frequently-accessed, throughput-intensive workloads. Informatica tests showed throughput of up to 160 MB/sec.; This volume type is a good fit for large, sequential workloads such as Amazon EMR, ETL, data warehouses, and log processing. Recommended if minimizing cost is an important concern.; Transformation-specific tests showed that HDD is 8-10% slower than general purpose SSD.
General purpose SSD: General-purpose solid state drives (SSD), also known as GP2 volumes, are intended to balance low cost and performance for a variety of transactional workloads. GP2 volumes showed throughput of up to 200 MB/sec for sequential read and write operations. Recommended for best performance for most batch use cases.
Provisioned IOPS SSD: Provisioned input-output operations-per-second SSD, also known as IO1 drives, are intended for mission-critical applications that require high performance. IO1 volumes support up to 20,000 IOPS and throughput of 330 MB/sec for sequential read and write operations. They are the fastest option and tend to be the most expensive.

Recommendations for Choosing EC2 Instances

The following table contains recommendations for each workload type, supported by Informatica performance testing:

Workload Characteristics	EC2 Instance Type Recommendation
CPU-bound mapping and pass-through mapping	Use C series instances for task nodes. Task nodes are cheaper than core nodes. These task nodes do not have a persistent data store.
I/O-bound mapping	For processing a low volume of data up to 5TB, use core nodes with default storage. For example, the d2.2xlarge instance has default storage of 6x2000 HDD. In general, HDD storage is appropriate for I/0-bound mapping workloads. For a large volume of data (10 TB or higher), add additional EBS HDD volumes to core nodes.
Mixed load	Use core nodes with additional EBS HDD and for computational needs dynamically add task nodes to meet your cluster's varying capacity requirements. General-purpose SSD are faster than HDD but more expensive.

Rename Saved Search

Table of Contents

Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Implementing Informatica® Big Data Management 10.2 in an Amazon Cloud Environment

Guidelines for Selecting EC2 Instances for the EMR Cluster

Guidelines for Selecting EC2 Instances for the EMR Cluster

EC2 Instance Types

Mapping Workload Types and Use Cases

Types of Cluster Nodes

Storage Types

Recommendations for Choosing EC2 Instances