Data Ingestion and Replication
- Data Ingestion and Replication
- All Products
Property | Description |
---|---|
Output Format
| Select the format of the output file. Options are:
The default value is CSV .Output files in CSV format use double-quotation marks ("") as the delimiter for each field.
|
Add Headers to CSV File
| If
CSV is selected as the output format, select this check box to add a header with source column names to the output CSV file.
|
Parquet Compression Type
| If the
PARQUET output format is selected, you can select a compression type that is supported by Parquet. Options are:
The default value is
None , which means no compression is used.
|
Avro Format
| If you selected
AVRO as the output format, select the format of the Avro schema that will be created for each source table. Options are:
The default value is
Avro-Flat .
|
Avro Serialization Format
| If
AVRO is selected as the output format, select the serialization format of the Avro output file. Options are:
The default value is
Binary .
|
Avro Schema Directory
| If
AVRO is selected as the output format, specify the local directory where
Application Ingestion and
Replication stores Avro schema definitions for each source table. Schema definition files have the following naming pattern:
If this directory is not specified, no Avro schema definition file is produced.
|
File Compression Type
| Select a file compression type for output files in CSV or AVRO output format. Options are:
The default value is
None , which means no compression is used.
|
Avro Compression Type
| If
AVRO is selected as the output format, select an Avro compression type. Options are:
The default value is
None , which means no compression is used.
|
Deflate Compression Level
| If
Deflate is selected in the
Avro Compression Type field, specify a compression level from 0 to 9. The default value is 0.
|
Add Directory Tags
| For incremental load and combined initial and incremental load tasks, select this check box to add the "dt=" prefix to the names of apply cycle directories to be compatible with the naming convention for Hive partitioning. This check box is cleared by default.
|
Task Target Directory
| For incremental load and combined initial and incremental load tasks, the root directory for the other directories that hold output data files, schema files, and CDC cycle contents and completed files. You can use it to specify a custom root directory for the task. If you enable the
Connection Directory as Parent option, you can still optionally specify a task target directory to use with the parent directory specified in the connection properties.
This field is required if the {TaskTargetDirectory} placeholder is specified in patterns for any of the following directory fields.
|
Data Directory
| For initial load tasks , define a directory structure for the directories where Application Ingestion and
Replication stores output data files and
optionally stores the schema. The default directory pattern is
{TableName)_{Timestamp} . To customize the directory pattern, click the Edit icon to select from the following
listed path types and values:
If you manually
enter the directory expression, ensure that you enclose placeholders
with curly brackets { }. Placeholder values are not case sensitive. For example:
For incremental load and combined initial and incremental load tasks , define a
custom path to the subdirectory that contains the cdc-data data
files. The default directory pattern is
{TaskTargetDirectory}/data/{TableName}/data
To customize the directory pattern, click the Edit icon to select
from the following listed path types and values:
For Amazon S3 and Microsoft Azure Data
Lake Storage Gen2 targets, Application Ingestion and
Replication uses the directory specified in
the target connection properties as the root for the data directory
path when Connection Directory as Parent is selected. For
Google Cloud Storage targets, Application Ingestion and
Replication uses the
Bucket name that you specify in the
target properties for the ingestion task. For Microsoft Fabric
OneLake targets, the parent directory is the path specified in the
Lakehouse Path field in the Microsoft
Fabric OneLake connection properties. For Amazon S3 targets with
Open Table format, the data directory field is not applicable.
Enabling the Connection Directory as Parent includes the
connection directory before the warehouse base path. If disabled,
files are saved directly under the warehouse base directory. |
Connection Directory as Parent
| Select this check box to use the directory value that is specified in the target connection properties as the parent directory for the custom directory paths specified in the task target properties. For initial load tasks, the parent directory is used in the
Data Directory and
Schema Directory . For incremental load and combined initial and incremental load tasks, the parent directory is used in the
Data Directory ,
Schema Directory ,
Cycle Completion Directory , and
Cycle Contents Directory .
This check box is selected by default. If you clear it, for initial loads, define the full path to the output files in the
Data Directory field. For incremental loads, optionally specify a root directory for the task in the
Task Target Directory .
|
Schema Directory
| Specify a custom directory to store the schema file if you want to store it in a directory
other than the default directory. For initial loads, previously used
values if available are shown in a list for your convenience. This
field is optional. For initial loads, the schema is stored in the data directory by default. For incremental loads and combined initial and incremental loads, the default directory for the schema file is
{TaskTargetDirectory}/data/{TableName}/schema
You can use the same placeholders as for the Data Directory field.
If you manually enter placeholders, ensure that you enclose them
with curly brackets { }. If you include the toUpper or toLower
function, put the placeholder name in parentheses and enclose both
the function and placeholder in curly brackets, for example:
{toLower(SchemaName)}
Schema is written only to output data files in CSV format. Data files in Parquet and Avro formats contain their own embedded schema.
|
Cycle Completion Directory
| For incremental load and combined initial and incremental load tasks, the path to the directory that contains the cycle completed file. Default is
{TaskTargetDirectory}/cycle/completed .
|
Cycle Contents Directory
| For incremental load and combined initial and incremental load tasks, the path to the directory that contains the cycle contents files. Default is
{TaskTargetDirectory}/cycle/contents .
|
Use Cycle Partitioning for Data Directory
| For incremental load and combined initial and incremental load tasks, causes a timestamp subdirectory to be created for each CDC cycle, under each data directory.
If this option is not selected, individual data files are written to the same directory without a timestamp, unless you define an alternative directory structure.
|
Use Cycle Partitioning for Summary Directories
| For incremental load and combined initial and incremental load tasks, causes a timestamp subdirectory to be created for each CDC cycle, under the summary contents and completed subdirectories.
|
List Individual Files in Contents
| For incremental load and combined initial and incremental load tasks, lists individual data files under the contents subdirectory.
If
Use Cycle Partitioning for Summary Directories is cleared, this option is selected by default. All of the individual files are listed in the contents subdirectory unless you can configure custom subdirectories by using the placeholders, such as for timestamp or date.
If
Use Cycle Partitioning for Data Directory is selected, you can still optionally select this check box to list individual files and group them by CDC cycle.
|
Property | Description |
---|---|
Add Operation Type
| Select this check box to add a metadata column that records the source SQL operation type in the output that the job propagates to the target.
For incremental loads, the job writes "I" for insert, "U" for update, or "D" for delete. For initial loads, the job always writes "I" for insert.
By default, this check box is selected for incremental load and initial and incremental load jobs, and cleared for initial load jobs.
|
Add Operation Time
| Select this check box to add a metadata column that records the source SQL operation timestamp in the output that the job propagates to the target.
For initial loads, the job always writes the current date and time.
By default, this check box is not selected.
|
Add Orderable Sequence | Select this check box to add a metadata column that records a combined epoch value and an
incremental numeric value for each change operation that the job
inserts into the target tables. The sequence value is always
ascending, but not guaranteed to be sequential and gaps may exist.
The sequence value is used to identify the order of activity in the
target records. By default, this check box is not selected. |
Add Before Images
| Select this check box to include UNDO data in the output that a job writes to the target.
For initial loads, the job writes nulls.
By default, this check box is not selected.
|