You can configure the following parameters to optimize the Amazon Redshift mapping performance:
Instance size for Informatica Data Integration Service in an Amazon EMR cluster
The Amazon EMR cluster configures each instances with the appropriate Hadoop configuration settings, such as Java memory size, number of mappers, and number of reducers for each instances. When you provision an Amazon EMR cluster instance, you must choose the instance size of your nodes appropriately as some workloads are CPU intensive while others are disk I/O or memory intensive.
Use the following guidelines to choose the right instance size for Informatica Data Integration Service in an Amazon EMR cluster:
For memory intensive jobs, you must ensure that the
m2
family instance size have enough memory and CPU power to perform the task.
For CPU intensive jobs, you must ensure that you have a
c2-series
instance.
If the jobs are both memory and CPU intensive, you must ensure that you use the
cg1.4xlarge
or
cc2.8xlarge
instance as they have suitable memory and CPU power to handle a heavy workload.
Informatica recommend that you use the pre-configured Amazon EMR cluster parameters even though you can change your Hadoop configuration by bootstrapping the Amazon EMR cluster.
Amazon Redshift cluster
When you provision an Amazon Redshift cluster, you must choose the
Redshift Compute Node Type
and
Number of Redshift Compute Nodes
parameters appropriately.
Choose the Amazon Redshift cluster in the same region where the Data Integration Service machine runs.
Disk space
Amazon Redshift recommends that you use the
cc2
,
cg1
, or
D2
series that are ideal for the Hadoop distributed storage and MapR distributed HDD storage. You must have two disks each of 2 TB instead of two disks with 1 TB and 4 TB to reduce the disk I/O, improve write performance, and minimize the downtime.
Informatica recommend that you use multiple disks with smaller capacity instead of a single disk with larger capacity.
CPU clock speed
If your mappings consist of multiple CPU bound transformations in a medium or large data sets, Informatica recommends that you use high clock speed processor.
Network
If you need to shuffle or sort large amount data, you must use the
cg1.4xlarge
and
cc2.8xlarge
instances with 10 Gb of network speed. When the data is in memory, multiple applications are network bound.
To make the
Distributed Reduce
applications such as group-bys, reduce-bys, and SQL joins to perform faster, use 10 Gb or higher network.
You can see how much data shuffles or sorts across the network from the