Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Back Next

Use Hive Connection for Read and Write Operations

You can use a Hive connection to read data from and write data to Amazon S3.

When you run a mapping on the Blaze engine, a Hive connection reads and writes data to target files directly without creating the staging files. Writing data directly to Amazon S3 target improves the performance. For Amazon EMR distribution, Informatica recommends to use Hive connection in place of Amazon S3 connection for read and write operations. The Amazon S3 bucket and the Hadoop cluster must reside in the same region.

To use Hive connection for read and write operations on Amazon S3, perform the following steps:

Create a Hive table on Amazon S3

Configure core-site.xml

Create a Hive table on Amazon S3

Before you create a mapping to read or write data, create a Hive table for the Amazon S3 source and target. The following snippet shows the SQL command to create a sample table in Hive:

CREATE EXTERNAL TABLE lineitem (l_orderkey int, l_partkey int , l_suppkey int , l_linenumber int, l_quantity decimal(15,2), l_extendedprice decimal(15,2), l_discount decimal(15,2), l_tax decimal(15,2), l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode varchar(10), l_comment STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" LOCATION 's3a://us-standard-bucket-test/S3_hive/';

The command creates a table named

lineitem

and the table points to the Amazon S3 location,

us-standard-bucket-test/S3_hive/

. Informatica recommends to use s3a URI scheme to connect to Amazon S3 through Hive connection.

Configure core-site xml

On each Hadoop node, update the core-site.xml file to add the following details to connect to Amazon S3:

Filesystem

S3 Endpoint

Access Key ID

Secret Access Key ID

The following snippet shows the sample

core-site.xml

<property>

<name>fs.s3a.impl</name>

<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>

</property>

<property>

<name>fs.s3a.endpoint</name>

<value>s3.amazonaws.com</value>

</property>

<property>

<name>fs.s3a.awsAccessKeyId</name>

<value>***************</value>

</property>

<property>

<name>fs.s3a.awsSecretAccessKey</name>

<value>*****************</value>

</property>

After creating the Hive table and updating the

core-site.xml

file, create a Hive connection. You can use the Hive connection in a mapping as a reader or writer.

Case Study

The following images show the Writer and Reader performance comparison between the PowerExchange for Amazon S3 connection and Hive connection on Amazon S3:

Rename Saved Search

Table of Contents

Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Use Hive Connection for Read and Write Operations

Use Hive Connection for Read and Write Operations