PowerExchange for Amazon S3 User Guide

10.5.6
- 10.5.9
- 10.5.8
- 10.5.7
- 10.5.5
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.1
- 10.5
- 10.4.1
- 10.4.0

Back Next

Data Compression in Amazon S3 Sources and Targets

You can decompress the data when you read data from Amazon S3 or compress data when you write data to Amazon S3.

Data Compression is applicable when you run a mapping in the native environment or on the Spark and Databricks Spark engine.

Configure the compression format in the

Compression Format

option under the advanced properties for an Amazon S3 data object read and write operation. The source or target file in Amazon S3 contains the same extension that you select in the

Compression Format

option.

When you perform a read operation, the Data Integration Service decompresses the data and then sends the data to Amazon S3 bucket. When you perform a write operation, the Data Integration Service compresses the data.

The following table lists the compression formats for the support for various operations and file formats in the native environment or on the Spark and Databricks Spark engine:

Compression format	Read	Write	Avro File	JSON File	ORC File	Parquet File
None	Yes	Yes	Yes	No	Yes	Yes
Bzip2	No	No	No	Yes	No	No
Deflate	Yes	Yes	Yes	Yes	No	No
Gzip	Yes	Yes	No	Yes	No	Yes
Lzo	Yes	Yes	No	No	No	Yes
Snappy	Yes	Yes	Yes	Yes	Yes	Yes
Zlib	Yes	Yes	No	No	Yes	No

Reading from files that use deflate, snappy, and zlib compression formats is implicit. You must select

None

to read files that use deflate, snappy, and zlib compression formats. For example, to read a parquet file that uses snappy compression, select

None

You can compress and decompress a binary file that uses gzip compression.

You can compress or decompress a flat file that uses the none, deflate, gzip, snappy, and zlib compression formats when you run a mapping in the native environment. You can compress or decompress a flat file that use the none, gzip, bzip2, and lzo compression formats when you run a mapping on the Spark engine.

When you run a mapping on the Spark or Databricks Spark engine to write multiple Avro files of different compression formats, the Data Integration Service does not write the data to the target properly. You must ensure that you use the same compression format for all the Avro files.

In the native environment, when you create a mapping to read or write an ORC file and select Lzo as the compression format, the mapping fails.

To read a compressed file from Amazon S3 on the Spark engine, the compressed file must have specific extensions. If the extensions used to read the compressed file are not specific or not valid, the Data Integration Service does not process the file.

The following table describes the extensions that are appended based on the compression format that you use:

Compression Format	File Name Extension
Gzip	.GZ
Deflate	.deflate
Bzip2	.BZ2
Lzo	.LZO
Snappy	.snappy
Zlib	.zlib

Rename Saved Search

Table of Contents

PowerExchange for Amazon S3 User Guide

PowerExchange for Amazon S3 User Guide

Data Compression in Amazon S3 Sources and Targets

Data Compression in Amazon S3 Sources and Targets