Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Performance Tuning Guidelines to Read Data from or Write Data to Amazon S3

Use Hive Connection for Read and Write Operations

Use Hive Connection for Read and Write Operations

You can use a Hive connection to read data from and write data to Amazon S3.
When you run a mapping on the Blaze engine, a Hive connection reads and writes data to target files directly without creating the staging files. Writing data directly to Amazon S3 target improves the performance. For Amazon EMR distribution, Informatica recommends to use Hive connection in place of Amazon S3 connection for read and write operations. The Amazon S3 bucket and the Hadoop cluster must reside in the same region.
To use Hive connection for read and write operations on Amazon S3, perform the following steps:
  • Create a Hive table on Amazon S3
  • Configure core-site.xml
Create a Hive table on Amazon S3
Before you create a mapping to read or write data, create a Hive table for the Amazon S3 source and target. The following snippet shows the SQL command to create a sample table in Hive:
CREATE EXTERNAL TABLE lineitem (l_orderkey int, l_partkey int , l_suppkey int , l_linenumber int, l_quantity decimal(15,2), l_extendedprice decimal(15,2), l_discount decimal(15,2), l_tax decimal(15,2), l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode varchar(10), l_comment STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" LOCATION 's3a://us-standard-bucket-test/S3_hive/';
The command creates a table named
lineitem
and the table points to the Amazon S3 location,
us-standard-bucket-test/S3_hive/
. Informatica recommends to use s3a URI scheme to connect to Amazon S3 through Hive connection.
Configure core-site xml
On each Hadoop node, update the core-site.xml file to add the following details to connect to Amazon S3:
  • Filesystem
  • S3 Endpoint
  • Access Key ID
  • Secret Access Key ID
The following snippet shows the sample
core-site.xml
:
<property>
<name>fs.s3a.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.awsAccessKeyId</name>
<value>***************</value>
</property>
<property>
<name>fs.s3a.awsSecretAccessKey</name>
<value>*****************</value>
</property>
After creating the Hive table and updating the
core-site.xml
file, create a Hive connection. You can use the Hive connection in a mapping as a reader or writer.
Case Study
The following images show the Writer and Reader performance comparison between the PowerExchange for Amazon S3 connection and Hive connection on Amazon S3:

0 COMMENTS

We’d like to hear from you!