Application Ingestion and Replication

Back Next

Custom directory structure for output files on Amazon S3, Google Cloud Storage, Microsoft Fabric OneLake, and ADLS Gen2 targets

You can configure a custom directory structure for the output files that initial load, incremental load, and combined initial and incremental load jobs write to Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage (ADLS) Gen2, and Microsoft Fabric OneLake targets if you do not want to use the default structure.

Initial loads

By default, initial load jobs write output files to tablename_timestamp subdirectories under the parent directory. For Amazon S3 and ADLS Gen2 targets, the parent directory is specified in the target connection properties if the

Connection Directory as Parent

check box is selected on the

Target

page of the task wizard.

In an Amazon S3 connection, this parent directory is specified in the

Folder Path

field.

In an ADLS Gen2 connection, the parent directory is specified in the

Directory Path

field.

For Google Cloud Storage targets, the parent directory is the bucket container specified in the

Bucket

field on the

Target

page of the task wizard.

For Microsoft Fabric OneLake targets, the parent directory is the path specified in the

Lakehouse Path

field in the Microsoft Fabric OneLake connection properties.

You can customize the directory structure to suit your needs. For example, for initial loads, you can write the output files under a root directory or directory path that is different from the parent directory specified in the connection properties to better organize the files for your environment or to find them more easily. Or you can consolidate all output files for a table directly in a directory with the table name rather than write the files to separate timestamped subdirectories, for example, to facilitate automated processing of all of the files.

To configure a directory structure, you must use the

Data Directory

field on the

Target

page of the ingestion task wizard. The default value is

{TableName}_{Timestamp}

, which causes output files to be written to tablename_timestamp subdirectories under the parent directory. You can configure a custom directory path by creating a directory pattern that consists of any combination of case-insensitive placeholders and directory names.

By default, the target schema is also written to the data directory. If you want to use a different directory for the schema, you can also define a directory pattern in the

Schema Directory

field.

You can manually enter the directory pattern for storing data or schema, or use the Edit icon located next to the

Data Directory

and

Schema Directory

fields to specify directory patterns using predefined placeholders.

The following table lists the available placeholders that you can use to build your data and schema directory path:

Path Type	Value
Folder Path	Enter a folder name or use variables to create a folder name. For example, to organize your data by date, schema, table, and load time, enter {yyyy}/{mm}/{dd}/{SchemaName}/{TableName}/{Timestamp}
Timestamp	You can select from the following values: {Timestamp} for the date and time, in the format yyyymmdd_hhmissms, at which the initial load job started to transfer data to the target. {Schema} for the target schema name. {yy} for a two-digit year. {yyyy} for a four-digit year. {mm} for a two-digit month value. {dd} for a two-digit day in the month.
Schema Name	You can select from the following values: {SchemaName} for the target schema name. toUpper(SchemaName) to use uppercase for the values represented by the placeholder in parentheses. toLower(SchemaName) to use lowercase for the values represented by the placeholder in parentheses.
Table Name	You can select from the following values : {TableName} for the target table name. toUpper(TableName) to use uppercase for the values represented by the placeholder in parentheses. toLower(TableName) to use lowercase for the values represented by the placeholder in parentheses.

Example 1

You are using an Amazon S3 target and want to write output files and the target schema to the same directory, which is under the parent directory specified in the

Folder Path

field of the connection properties. In this case, the parent directory is

idr-test/DEMO/

. You want to write all of the output files for a table to a directory that has a name matching the table name, without a timestamp. You must complete the

Data Directory

field and select the

Connection Directory as Parent

check box.

Based on this configuration, the resulting directory structure is:

Example 2

You are using an Amazon S3 target and want to write output data files to a custom directory path and write the target schema to a separate directory path. To use the directory specified in the

Folder Path

field in the Amazon S3 connection properties as the parent directory for the data directory and schema directory, select

Connection Directory as Parent

. In this case, the parent directory is

idr-test/DEMO/

. In the

Data Directory

and

Schema Directory

fields, define directory patterns by using a specific directory name, such as data_dir and schema_dir, followed by the default {TableName}_{Timestamp} placeholder value. The placeholder creates tablename_timestamp destination directories.

Based on this configuration, the resulting data directory structure is:

And the resulting schema directory structure is:

Incremental loads and combined initial and incremental loads

By default, incremental load and combined initial and incremental load jobs write cycle files and data files to subdirectories under the parent directory. However, you can create a custom directory structure to organize the files to best suit your organization's requirements.

This feature applies to

application ingestion and replication

incremental load jobs that have a Salesforce source and Amazon S3, Google Cloud Storage, Microsoft Fabric OneLake, or Microsoft Azure Data Lake Storage (ADLS) Gen2 targets.

For Amazon S3 and ADLS Gen2 targets, the parent directory is set in the target connection properties if the

Connection Directory as Parent

check box is selected on the

Target

page of the task wizard.

In an Amazon S3 connection, the parent directory is specified in the

Folder Path

field.

In an ADLS Gen2 connection, the parent directory is specified in the

Directory Path

field.

For Google Cloud Storage targets, the parent directory is the bucket container specified in the

Bucket

field on the

Target

page of the task wizard.

For Microsoft Fabric OneLake targets, the parent directory is the path specified in the

Lakehouse Path

field in the Microsoft Fabric OneLake connection properties.

You can customize the directory structure to suit your needs. For example, you can write the data and cycle files under a target directory for the task instead of under the parent directory specified in the connection properties. Alternatively, you can 1) consolidate table-specific data and schema files under a subdirectory that includes the table name, 2) partition the data files and summary contents and completed files by CDC cycle, or 3) create a completely customized directory structure by defining a pattern that includes literal values and placeholders. For example, if you want to run SQL-type expressions to process the data based on time, you can write all data files directly to timestamp subdirectories without partitioning them by CDC cycle.

To configure a custom directory structure for an incremental load task, define a pattern for any of the following optional fields on the

Target

page of the ingestion task wizard:

Field	Description	Default
Task Target Directory	Name of a root directory to use for storing output files for an incremental load task. If you select the Connection Directory as Parent option, you can still optionally specify a task target directory. It will be appended to the parent directory to form the root for the data, schema, cycle completion, and cycle contents directories. This field is required if the {TaskTargetDirectory} placeholder is specified in patterns for any of the following directory fields.	None
Connection Directory as Parent	Select this check box to use the parent directory specified in the connection properties. This field is not available for the Microsoft Fabric OneLake target.	Selected
Data Directory	Path to the subdirectory that contains the data files. In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.	{TaskTargetDirectory}/data/{TableName}/data
Schema Directory	Path to the subdirectory in which to store the schema file if you do not want to store it in the data directory. In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.	{TaskTargetDirectory}/data/{TableName}/schema
Cycle Completion Directory	Path to the directory that contains the cycle completed file.	{TaskTargetDirectory}/cycle/completed
Cycle Contents Directory	Path to the directory that contains the cycle contents files.	{TaskTargetDirectory}/cycle/contents
Use Cycle Partitioning for Data Directory	Causes a timestamp subdirectory to be created for each CDC cycle, under each data directory. If this option is not selected, individual data files are written to the same directory without a timestamp, unless you define an alternative directory structure.	Selected
Use Cycle Partitioning for Summary Directories	Causes a timestamp subdirectory to be created for each CDC cycle, under the summary contents and completed subdirectories.	Selected
List Individual Files in Contents	Lists individual data files under the contents subdirectory. If Use Cycle Partitioning for Summary Directories is cleared, this option is selected by default. All of the individual files are listed in the contents subdirectory unless you can configure custom subdirectories by using the placeholders, such as for timestamp or date. If Use Cycle Partitioning for Data Directory is selected, you can still optionally select this check box to list individual files and group them by CDC cycle.	Not selected if Use Cycle Partitioning for Summary Directories is selected. Selected if you cleared Use Cycle Partitioning for Summary Directories .

The directory pattern can consist of any combination of case-insensitive placeholders, shown in curly brackets { }, and specific directory names. To specify a custom data directory or schema directory expression, use the Edit icon available at the fields and select the directory pattern from the listed placeholders:

Path Type	Value
Folder Path	Enter {TaskTargetDirectory} for a task-specific base directory on the target to use instead of the S3 folder path specified in the connection properties.
Timestamp	You can select from the following values: {Timestamp} for the date and time, in the format yyyymmdd_hhmissms {Schema} for the target schema name. {yy} for a two-digit year. {yyyy} for a four-digit year. {mm} for a two-digit month value. {dd} for a two-digit day in the month. Use {Timestamp}, {yy}, {yyyy}, {mm}, and {dd} in directory patterns to insert specific date and time information into directory names for organizing data. When you specify these placeholders in directory patterns for data, contents, and completed directories, these placeholders represent the time when the CDC cycle began. For the schema directory, these placeholders represent the time when the entire CDC job started, not just the cycle.
Schema Name	You can select from the following values: {SchemaName} for the target schema name. toUpper(SchemaName) to use uppercase for the values represented by the placeholder in parentheses. toLower(SchemaName) to use lowercase for the values represented by the placeholder in parentheses.
Table Name	You can select from the following values: {TableName} for the target table name. toUpper(TableName) to use uppercase for the values represented by the placeholder in parentheses. toLower(TableName) to use lowercase for the values represented by the placeholder in parentheses.

Example 1

You want to accept the default directory settings for incremental load jobs as displayed in the task wizard. The target type is Amazon S3. Because the

Connection Directory as Parent

check box is selected by default, the parent directory path that is specified in the

Folder Path

field of the Amazon S3 connection properties is used. This parent directory is

idr-test/dbmi/

. You also must specify a task target directory name, in this case, s3_target, because the {TaskTargetDirectory} placeholder is used in the default patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name because the {TableName} placeholder is included in their default patterns. Also, because cycle partitioning is enabled, the files in the data directory, schema directory, and cycle summary directories will be subdivided by CDC cycle.

Based on this configuration, the resulting data directory structure is:

If you drill down on the data folder and then on a table in that folder (pgs001_src_allint_init), you can access the data and schema subdirectories:

If you drill down on the data folder, you can access the timestamp directories for the data files:

If you drill down on cycle, you can access the summary contents and completed subdirectories:

Example 2

You want to create a custom directory structure for incremental load jobs that adds the subdirectories "demo" and "d1" in all of the directory paths except in the schema directory so that you can easily find the files for your demos. Because the

Connection Directory as Parent

check box is selected, the parent directory path (

idr-test/dbmi/

) that is specified in the

Folder Path

field of the Amazon S3 connection properties is used. You also must specify the task target directory because the {TaskTargetDirectory} placeholder is used in the patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name. Also, because cycle partitioning is enabled, the files in the data, schema, and cycle summary directories will be subdivided by CDC cycle.

Based on this configuration, the resulting data directory structure is: