Table of Contents

Search

  1. Preface
  2. Introduction to Microsoft Azure Data Lake Storage Gen2 Connector
  3. Connections for Microsoft Azure Data Lake Storage Gen2
  4. Mappings for Microsoft Azure Data Lake Storage Gen2
  5. Migrating a mapping
  6. Data type reference
  7. Troubleshooting

Microsoft Azure Data Lake Storage Gen2 Connector

Microsoft Azure Data Lake Storage Gen2 Connector

Microsoft Azure Data Lake Storage Gen2 sources in mappings

Microsoft Azure Data Lake Storage Gen2 sources in mappings

In a mapping, you can configure a source transformation to represent a single Microsoft Azure Data Lake Storage Gen2 object.
The following table describes the Microsoft Azure Data Lake Storage Gen2 source properties that you can configure in a source transformation:
Property
Description
Connection
Name of the source connection. Select a source connection or click
New Parameter
to define a new parameter for the source connection.
If you want to overwrite the parameter at runtime, select the
Allow parameter to be overridden at run time
option when you create a parameter. When the task runs, the agent uses the parameters from the file that you specify in the task advanced session properties. Ensure that the parameter file is in the correct format.
When you switch between a non-parameterized and a parameterized Microsoft Azure Data Lake Storage Gen2 connection, the advanced property values are retained.
Source Type
Select Single Object or Parameter.
Object
Name of the source object.
Ensure that the headers or file data does not contain special characters.
Parameter
Select an existing parameter for the source object or click
New Parameter
to define a new parameter for the source object. The
Parameter
property appears only if you select Parameter as the source type.
When you parameterize the source object, specify the complete object path including the file system in the default value of the parameter.
If you want to overwrite the parameter at runtime, select the
Allow parameter to be overridden at run time
option when you create a parameter. When the task runs, the agent uses the parameters from the file that you specify in the task advanced session properties. Ensure that the parameter file is in the correct format.
Format
Specifies the file format that the Microsoft Azure Data Lake Storage Gen2 Connector uses to read data from Microsoft Azure Data Lake Storage Gen2.
You can select the following file format types:
  • Flat
  • Avro
  • Parquet
  • JSON
  • ORC
  • Discover Structure
    1
Default is
None
. If you select
None
as the format type, Microsoft Azure Data Lake Storage Gen2 Connector reads data from Microsoft Azure Data Lake Storage Gen2 files in binary format.
You cannot read a JSON file that exceeds 1 GB.
Ensure that the source file is not empty.
For more information, see File formatting options
Intelligent Structure Model
1
Applies to Discover Structure format type. Determines the underlying patterns in a sample file and auto-generates a model for files with the same data and structure.
Select one of the following options to associate a model with the transformation:
  • Select. Select an existing model.
  • New. Create a new model. Select
    Design New
    to create the model. Select
    Auto-generate from sample file for Intelligent Structure Discovery
    to generate a model based on sample input that you select.
Select one of the following options to validate the XML source object against an XML-based hierarchical schema:
  • Source object doesn't require validation.
  • Source object requires validation against a hierarchical schema. Select to validate the XML source object against an existing or a new hierarchical schema.
When you create a mapping task, on the
Runtime Options
tab, you configure how Data Integration handles the schema mismatch. You can choose to skip the mismatched files and continue to run the task or stop the task when the task encounters the first file that does not match.
For more information, see
Components
.
1
Applies only to mappings in advanced mode.
The following table describes the Microsoft Azure Data Lake Storage Gen2 source advance properties:
Property
Description
Concurrent Threads
1
Number of concurrent connections to extract data from the Microsoft Azure Data Lake Storage Gen2. When reading a large file or object, you can spawn multiple threads to process data. Configure
Block Size
to divide a large file into smaller parts.
Default is 4. Maximum is 10.
Filesystem Name Override
Overrides the default file system name.
Source Type
Select the type of source from which you want to read data. You can select the following source types:
  • File
  • Directory
Default is File.
Allow Wildcard Characters
Indicates whether you want to use wildcard characters for the directory source type.
For more information, see Wildcard characters.
Directory Override
Microsoft Azure Data Lake Storage Gen2 directory that you use to read data. Default is root directory. The directory path specified at run time overrides the path specified while creating a connection.
You can specify an absolute or a relative directory path:
  • Absolute path - The Secure Agent searches this directory path in the specified file system.
    Example of absolute path:
    Dir1/Dir2
  • Relative path - The Secure Agent searches this directory path in the native directory path of the object.
    Example of relative path:
    /Dir1/Dir2
    When you use the relative path, the imported object path is added to the file path used during the metadata fetch at runtime.
Do not specify a root directory (
/
) to override the directory.
File Name Override
Source object. Select the file from which you want to read data. The file specified at run time overrides the file specified in Object.
Block Size
1
Applicable to flat file format. Divides a large file into smaller specified block size. When you read a large file, divide the file into smaller parts and configure concurrent connections to spawn the required number of threads to process data in parallel.
Specify an integer value for the block size.
Default value in bytes is 8388608.
Timeout Interval
Not applicable.
Recursive Directory Read
Indicates whether you want to read objects stored in subdirectories in mappings.
For more information, see Reading files from subdirectories
Incremental File Load
2
Indicates whether you want to incrementally load files when you use a directory as the source for mappings in advanced mode.
When you incrementally load files, the mapping task reads and processes only files in the directory that have changed since the mapping task last ran.
For more information, see Incrementally loading files.
Compression Format
Reads compressed data from the source.
Select one of the following options:
  • None. Select to read Avro, ORC, and Parquet files that use Snappy compression. The compressed files must have the
    .snappy
    extension.
    You cannot read compressed JSON files.
  • Gzip. Select to read flat files and Parquet files that use Gzip compression. The compressed files must have the
    .gz
    extension.
You cannot preview data for a compressed flat file.
Interim Directory
1
Optional. Applicable to flat files and JSON files.
Path to the staging directory in the Secure Agent machine.
Specify the staging directory where you want to stage the files when you read data from Microsoft Azure Data Lake Storage Gen2. Ensure that the directory has sufficient space and you have write permissions to the directory.
Default staging directory is
/tmp
.
You cannot specify an interim directory when you use the Hosted Agent.
Tracing Level
Sets the amount of detail that appears in the log file. You can choose terse, normal, verbose initialization or verbose data. Default is normal.
1
Doesn't apply to mappings in advanced mode.
2
Applies only to mappings in advanced mode.

0 COMMENTS

We’d like to hear from you!