Amazon S3 Connector Guide

Amazon S3 Connector Guide

Delimited Content Parser Process Objects

Delimited Content Parser Process Objects

After the event source is configured and published, it begins to monitor the specified S3 bucket for new objects. When a new object is found, the contents are parsed and an event is generated that contains S3 object information, local copy information (if any) and a set of process objects that represent the parsed content. The generated event is sent to the processes listening for these events.
As described above, you can handle generated objects as built-in process objects or custom process objects.
Use these factors as a guide to choose your approach:
  • Use Built-in Process Objects to process S3 objects with different structures and you do not know the content headers in advance.
  • Use Custom Object Fields to process S3 objects with the same structure when you know the headers in advance. This method enables you to extract only the required data. Generated objects are simpler and they require less code to handle them in your processes.
Whatever method you choose, be aware of the differences in the generated process objects.

Built-in Process Object Output

When each S3 Delimited Content object is found and processed, it generates an event (
S3DelimitedContentParserEvent
) with a single parameter (
delimitedContent)
to indicate whether you are using built-in process objects (
S3DelimitedContent
) or custom process objects (
CustomS3DelimitedContent
).
For example, if you process a simple delimited content object with the following data, you might read a file similar to this:
Country Capital Area Region USA Washington 11111 "North America" Ukraine Kiev 22222 Europe Japan Tokyo 33333 "Asia Pacific"
S3 object metadata field is always added to the generated events. Based on the metadata is available in the local file, the generated process object is similar to the following:
<S3DelimitedContent> <!-- S3 object information --> <s3ObjectInfo> <lastModified>2015-06-23T14:20:22Z</lastModified> <contentControl/> <s3VersionId/> <s3Key>test.csv</s3Key> <contentType>text/plain</contentType> <contentEncoding>UTF-8</contentEncoding> <contentDisposition/> <contentLength>130</contentLength> <bucketName>myBucket</bucketName> <s3ETag>ddfadc6de31dd30a6588bee8c01e157e</s3ETag> </s3ObjectInfo> <!-- local copy file information, will be empty if local copying is not used --> <localFileInfo> <lastModified>2015-06-23T14:20:22.086Z</lastModified> <dir>D:/camel_test/s3 monitor</dir> <name>test</name> <path>D:/camel_test/s3 monitor/test.csv </path> <fullName>test.csv</fullName> <ext> csv </ext> <size>130</size> </localFileInfo> <!-- total number of rows in the result --> <totalRowsCount>3</totalRowsCount> <!-- list of file headers with names and indexes --> <header> <name>Country</name> <fieldIndex>1</fieldIndex> </header> <header> <name>Capital</name> <fieldIndex>2</fieldIndex> </header> <header> <name>Area</name> <fieldIndex>3</fieldIndex> </header> <header> <name>Region</name> <fieldIndex>4</fieldIndex> </header> <!-- delimited content records --> <record> <field> <value>USA</value> </field> <field> <value>Washington</value> </field> <field> <value>1111</value> </field> <field> <value>North America</value> </field> ... </S3DelimitedContent>
The generated event contains an
S3DelimitedContent
process object. Local copy information might be empty if the event source does not store local copies of processed S3 objects.
You can see that the generated result contains information about the source object, list of header objects, list of records and total row count. The S3 object content is represented with the header name and value for each field.
If you work in split rows mode for delimited content, the S3 object content is divided into separate rows and for each row, the S3 connection produces a delimited content object with headers and one record.

Custom Object Fields Output

If you process the same S3 object, using, instead, the Custom Object Fields, you might provide the header names, "Country, Capital, Area" (but not "Region").
In that case, the results for the same object look similar to this:
<AwsS3DelimitedContentParserContent> <!-- S3 object information --> <s3ObjectInfo> <lastModified>2015-06-23T14:20:22Z</lastModified> <contentControl/> <s3VersionId/> <s3Key>test2.csv</s3Key> <contentType>text/plain</contentType> <contentEncoding>UTF-8</contentEncoding> <contentDisposition/> <contentLength>130</contentLength> <bucketName>myBucket</bucketName> <s3ETag>ddfadc6de31dd30a6588bee8c01e157e</s3ETag> </s3ObjectInfo> <!-- local copy file information, will be empty if local copying is not used --> <localFileInfo> <lastModified>2015-06-23T14:20:22.086Z</lastModified> <dir>D:/camel_test/s3 monitor</dir> <name>test</name> <path>D:/camel_test/s3 monitor/test.csv </path> <fullName>test2.csv</fullName> <ext>csv</ext> <size>130</size> </localFileInfo> <!-- total number of rows in the result --> <totalRowsCount>3</totalRowsCount> <!-- custom delimited content records with specified by user fields --> <record> <index>1</index> <Area>11111</Area> <Capital>Washington</Capital> <Country>USA</Country> </record> <record> <index>2</index> <Area>22222</Area> <Capital>Kiev</Capital> <Country>Ukraine</Country> </record> <record> <index>3</index> <Area>33333</Area> <Capital>Tokyo</Capital> <Country>Japan</Country> </record> </AwsS3DelimitedContentParserContent>
Here, the process object contains the S3 object information, local copy information, and total number of records. "Region" data was excluded because it was not specified as a custom object field.
Notice that when you use custom object fields:
  • The generated output still includes S3 object information.
  • The process object name takes the format
    <sourceName>Content
    .
    The custom record object uses the format
    <sourceName>Record
    . If you have several delimited content event sources, each of them uses its own custom content and record objects.
  • Field names are converted to
    NCName
    format to remove any prohibited characters from the delimited content header names and ensure they are valid process object field names.
  • If the source S3 object does not contain the required header, this field is empty in the generated output. In the above example, if "Region" had been specified as a custom object field and the source S3 object did not contain the "Region" header, the field would appear in the process object but be empty.
  • As shown above, to skip some fields, simply omit them from the list of Custom Object Fields.

0 COMMENTS

We’d like to hear from you!