Table of Contents

Search

  1. Preface
  2. Data Profiling
  3. Profiles
  4. Profile results
  5. Tuning data profiling task performance
  6. Troubleshooting

Data Profiling

Data Profiling

Advanced Options

Advanced Options

If you choose a source object such as Amazon S3, Azure Data Lake Store, Snowflake Data Cloud, Microsoft Azure Synapse SQL, or Amazon Redshift V2, you can configure the following advanced options for the file.
Amazon Athena
The following table lists the options that you can configure for an Amazon Athena source object:
Property
Description
Retain Athena Query Result On S3 File
Specifies whether you want to retain the Amazon Athena query result on the Amazon S3 file. Select the check box to retain the Amazon Athena query result on the Amazon S3 file.
The Amazon Athena query result in stored in the CSV file format.
By default, the
Retain Athena Query Result on S3 File
check box is not selected.
S3OutputLocation
Specifies the location of the Amazon S3 file that stores the result of the Amazon Athena query.
You can also specify the Amazon S3 file location in the
S3OutputLocation
parameter in the
JDBC URL
connection property.
If you specify the Amazon S3 output location in both the connection and the advanced source properties, the Secure Agent uses the Amazon S3 output location specified in the advanced source properties.
Fetch Size
Determines the number of rows to read in one result set from Amazon Athena.
Default is 10000.
Encryption Type
Encrypts the data in the Amazon S3 staging directory.
You can select the following encryption types:
  • None
  • SSE-S3
  • SSE-KMS
  • CSE-KMS
Default is None.
Schema Name
Overrides the schema name of the source object.
Source Table Name
Overrides the table name used in the metadata import with the table name that you specify.
SQL Query
Overrides the default SQL query.
Enclose column names in double quotes. The SQL query is case sensitive. Specify an SQL statement supported by the Amazon Athena database.
When you specify the columns in the SQL query, ensure that the column name in the query matches the source column name in the mapping.
Amazon S3 v2
The following table lists the options that you can configure for an Amazon S3 source object:
Option
Description
Source Type
Type of the source from which you want to read data.
You can select the following source types:
  • File
  • Directory
Default is
File
.
For more information about the source type, see Source types in Amazon S3 V2 sources.
Folder Path
Optional. Overwrites the bucket name or folder path of the Amazon S3 source file.If applicable, include the folder name that contains the source file in the
<bucket_name>/<folder_name>
format.
If you do not provide the bucket name and specify the folder path starting with a slash (/) in the
/<folder_name>
format, the folder path appends with the folder path that you specified in the connection properties.
For example, if you specify the
/<dir2>
folder path in this property and
<my_bucket1>/<dir1>
folder path in the connection property, the folder path appends with the folder path that you specified in the connection properties in
<my_bucket1>/<dir1>/<dir2>
format.
File Name
Optional. Overwrites the Amazon S3 source file name.
Allow Wildcard Characters
Use the ? and * wildcard characters to specify the folder path or file name if you run a mapping in advanced mode to read data from an Avro, flat, JSON, ORC, or Parquet file.
Enable Recursive Read
Use the recursive read option for flat, Avro, JSON, ORC, and Parquet files. The files that you read using recursive read must have the same metadata. Enable recursive read when you specify wildcard characters in a folder path or file name. To enable recursive read, select the source type as Directory.
Incremental File Load
Incrementally load source files in a directory to read and process only the files that have changed since the last time the mapping task ran.
Staging Directory
Optional. Path of the local staging directory. Ensure that the user has write permissions on the directory. In addition, ensure that there is sufficient space to enable staging of the entire file. Default staging directory is the /temp directory on the machine that hosts the Secure Agent.
When you specify the directory path, the Secure Agent create folders depending on the number of partitions that you specify in the following format:
InfaS3Staging<00/11><timestamp>_<partition number>
where, 00 represents read operation and 11 represents write operation.
For example,
InfaS3Staging000703115851268912800_0
The temporary files are created within the new directory. The staging directory in the source property does not apply to an advanced cluster. However, you must specify a staging directory on Amazon S3 in the advanced configuration.
For more information, see Administrator.
Hadoop Performance Tuning Options
Optional. This property is not applicable for Amazon S3 V2 Connector.
Compression Format
Decompresses data when you read data from Amazon S3.
You can choose to decompress the data in the following formats:
  • None
  • Gzip
Default is None.
Amazon S3 V2 Connector does not support the Lzo and Bzip2 compression format even though the option appears in this property.
For more information about the compression format, see Data compression in Amazon S3 V2 sources and targets.
Download Part Size
Downloads the part size of an Amazon S3 object in bytes.
Default is 5 MB. Use this property when you run a mapping to read a file of flat format type.
Multipart Download Threshold
Minimum threshold size to download an Amazon S3 object in multiple parts.
To download the object in multiple parts in parallel, ensure that the file size of an Amazon S3 object is greater than the value you specify in this property. Default is 10 MB.
Temporary Credential Duration
The time duration during which an IAM user can use the dynamically generated temporarily credentials to access the AWS resource. Enter the time duration in seconds.
Default is 900 seconds. If you require more than 900 seconds, you can set the time duration maximum up to 12 hours in the AWS console and then enter the same time duration in this property.
Azure Data Lake Store Gen2
The following table lists the options that you can configure for an Azure Data Lake Store source object:
Option
Description
Concurrent Threads
Optional. Number of concurrent connections to load data from the Microsoft Azure Data Lake Storage Gen2. When writing a large file, you can spawn multiple threads to process data. Configure
Block Size
to divide a large file into smaller parts. Default is 4. Maximum is 10.
Filesystem Name Override
Optional. Overrides the default file name.
Source Type
Type of the source from which you want to read data.
You can select the following source types:
  • File
  • Directory
Default is
File
.
For more information about the source type, see Directory Source in Microsoft Azure Data Lake Storage Gen2 Sources.
Allow Wildcard Characters
Use the ? and * wildcard characters to specify the folder path or file name if you run a mapping in advanced mode to read data from an Avro, flat, JSON, ORC, or Parquet file.
Directory Override
Optional. Microsoft Azure Data Lake Storage Gen2 directory that you use to write data. Default is root directory. The Secure Agent creates the directory if it does not exist. The directory path specified at run time overrides the path specified while creating a connection.
File Name Override
Optional. Target object. Select the file from which you want to write data. The file specified at run time overrides the file specified in Object.
Block Size
Optional. Divides a large file or object into smaller parts each of specified block size. When writing a large file, consider dividing the file into smaller parts and configure concurrent connections to spawn required number of threads to process data in parallel. Default is 8 MB.
Compression Format
Optional. Compresses and writes data to the target. Select
Gzip
to write flat files.
Timeout Interval
Optional. The number of seconds to wait when attempting to connect to the server. A timeout will occur if the connection cannot be established in the specified amount of time.
Interim Directory
Optional. Path to the staging directory in the Secure Agent machine.
Specify the staging directory where you want to stage the files when you read data from Microsoft Azure Data Lake Store. Ensure that the directory has sufficient space and you have write permissions to the directory.
Default staging directory is
/tmp
.
You cannot specify an interim directory for an advanced cluster.
Incremental File Load
Incrementally load source files in a directory to read and process only the files that have changed since the last time the mapping task ran.
Enable Recursive Read
Use the recursive read option for flat, Avro, JSON, ORC, and Parquet files. The files that you read using recursive read must have the same metadata. Enable recursive read when you specify wildcard characters in a folder path or file name. To enable recursive read, select the source type as Directory.
Google Cloud Storage
The following table lists the options that you can configure for a Google Cloud Storage source object:
Option
Description
Google Cloud Storage Path
Overrides the Google Cloud Storage path that you specified in the connection. This property is required when the source is not a flat file. Use the following format:
gs://<bucket name> or gs://<bucket name>/<folder name>
Source File Name
Optional. Overrides the Google Cloud Storage source file name that you specified in the Source transformation.
Does not apply when you configure Is Directory option to read multiple files from a directory.
Is Directory
Select this property to read all the files available in the folder specified in the Google Cloud Storage Path property.
Encryption Type
Method to decrypt data.
You can select one of the following encryption types:
  • Informatica Encryption
  • None
Default is None .
Snowflake Data Cloud
The following table lists the options that you can configure for a Snowflake Data Cloud source object:
Option
Description
Database
Overrides the database specified in the connection.
Schema
Overrides the schema specified in the connection.
Warehouse
Overrides the Snowflake warehouse name specified in the connection.
Role
Overrides the Snowflake role assigned to user specified in the connection.
Table Name
Overrides the table name of the imported Snowflake Data Cloud source table.
Amazon Redshift V2
The following table lists the options that you can configure for an Amazon Redshift V2 source object:
Option
Description
S3 Bucket Name
Amazon S3 bucket name for staging the data. You can also specify the bucket name with the folder path.
Enable Compression
Compresses the staging files into the Amazon S3 staging directory. The task performance improves when the Secure Agent compresses the staging files. Default is selected.
Staging Directory Location
Location of the local staging directory. When you run a task in Secure Agent runtime environment, specify a directory path that is available on the corresponding Secure Agent machine in the runtime environment.Specify the directory path in the following manner:
<staging directory>
For example,
C:\Temp
. Ensure that you have the write permissions on the directory.
Does not apply to an advanced cluster.
Temporary Credential Duration
The time duration during which an IAM user can use the dynamically generated temporarily credentials to access the AWS resource. Enter the time duration in seconds. Default is 900 seconds.
If you require more than 900 seconds, you can set the time duration up to a maximum of 12 hours in the AWS console and then enter the same time duration in this property.
Encryption Type
Encrypts the data in the Amazon S3 staging directory. You can select the following encryption types:
  • None
  • SSE-S3
  • SSE-KMS
  • CSE-SMK
You can only use SSE-S3 encryption in a mapping that runs on an advanced cluster.
Default is None.
Download S3 Files in Multiple Parts
Downloads large Amazon S3 objects in multiple parts. When the file size of an Amazon S3 object is greater than 8 MB, you can choose to download the object in multiple parts in parallel. Default is 5 MB. Does not apply to an advanced cluster.
Multipart Download Threshold Size
The maximum threshold size to download an Amazon S3 object in multiple parts. Default is 5 MB. Does not apply to an advanced cluster.
Schema Name
Overrides the default schema name.
You cannot configure a custom query when you use the schema name.
Source Table Name
Overrides the default source table name.
When you select the source type as Multiple Objects or Query , you cannot use the Source Table Name option.
Databricks Delta
The following table lists the options that you can configure for a Databricks Delta source object:
Option
Description
Database Name
Overrides the database name provided in connection and the database name provided during metadata import.
To read from multiple objects. ensure that you have specified the database name in the connection properties.
Table Name
Overrides the table name used in the metadata import with the table name that you specify.
SQL Override
Overrides the default SQL query used to read data from a Databricks Delta custom query source.
The column names in the SQL override query should match with the column names in the custom query in a SQL transformation.
The metadata of the source should be the same as SQL override to override the query.
You can use the option when you run the profiling task on a Data Integration Server.
Staging Location
Relative directory path to the staging file storage.
  • If the Databricks cluster is deployed on AWS, use the path relative to the Amazon S3 staging bucket.
  • If the Databricks cluster is deployed on Azure, use the path relative to the Azure Data Lake Store Gen2 staging filesystem name.
When you use the unity catalog, a pre-existing location on the user's cloud storage must be provided in the Staging Location. The Staging Location is not required for the Unity Catalog when you run the profiling task on a Data Integration Server.
Job Timeout
Maximum time in seconds that is taken by the Spark job to complete processing. If the job is not completed within the time specified, the Databricks cluster terminates the job and the mapping fails.
If the job timeout is not specified, the mapping shows success or failure based on the job completion.
Job Status Poll Interval
Poll interval in seconds at which the Secure Agent checks the status of the job completion.
Default is 30 seconds.
DB REST API Timeout
The Maximum time in seconds for which the Secure Agent retries the REST API calls to Databricks when there is an error due to network connection or if the REST endpoint returns
5xx HTTP
error code.
Default is 10 minutes.
DB REST API Retry Interval
The time Interval in seconds at which the Secure Agent must retry the REST API call, when there is an error due to network connection or when the REST endpoint returns
5xx HTTP
error code.
This value does not apply to the Job status REST API. Use job status poll interval value for the Job status REST API.
Default is 30 seconds.
Microsoft Azure Synapse SQL
The following table lists the options that you can configure for a Microsoft Azure Synapse SQL source object:
Option
Description
Azure Blob Container Name
Microsoft Azure Blob Storage container name. Required if you select Azure Blob storage in the connection properties.
ADLS FileSystem Name
The name of the file system in Microsoft Azure Data Lake Storage Gen2.
Required if you select ADLS Gen2 storage in the connection properties. You can also provide the path of the directory under given file system.
Schema Name Override
Overrides the schema specified in the connection.
Table Name Override
Overrides the table name of the imported Microsoft Azure Synapse SQL source table.
Field Delimiter
Character used to separate fields in the file. Default is
0x1e
. You can specify 'TAB' or 0-256 single-char printable and non-printable ASCII characters. Non-printable characters must be specified in hexadecimal.
Number of concurrent connections to Blob Store
Number of concurrent connections to extract data from the Microsoft Azure Blob Storage. When reading a large-size blob, you can spawn multiple threads to process data. Configure
Blob Part Size
to partition a large-size blob into smaller parts. Default is 4. Maximum is 10.
Blob Part Size
Partitions a blob into smaller parts each of specified part size. When reading a large-size blob, consider partitioning the blob into smaller parts and configure concurrent connections to spawn required number of threads to process data in parallel. Default is 8 MB.
Quote Character
The Secure Agent skips the specified character when you read data from Microsoft Azure Synapse SQL. Default is
0x1f
.
Interim Directory
Optional. Path to the staging directory in the Secure Agent machine.Specify the staging directory where you want to stage the files when you read data from Microsoft Azure Synapse SQL. Ensure that the directory has sufficient space and you have write permissions to the directory.
Default staging directory is /tmp.
You cannot specify an interim directory for an advanced cluster.
JDBC V2
The following table lists the options that you can configure for a JDBC V2 source object:
Option
Description
Pre SQL
The SQL query that the Secure Agent runs before reading data from the source.
Post SQL
1
The SQL query that the Secure Agent runs after reading data from the source.
Fetch Size
The number of rows that the Secure Agent fetches from the database in a single call.
Table Name
Overrides the table name used in the metadata import with the table name that you specify.
Schema Name
Overrides the schema name of the source object.
If you specify the schema name both in the connection and the source properties, the Secure Agent uses the schema name specified in the source properties.
SQL Override
The SQL statement to override the default query and the object name that is used to read data from the JDBC V2 source.
1
Doesn't apply to mappings in advanced mode.
Oracle Cloud Object Storage
The following table lists the options that you can configure for an Oracle Cloud Object Storage source object:
Option
Description
Folder Path
Overrides the folder path value in the Oracle Cloud Object Storage connection.
File Name
Overrides the Oracle Cloud Object Storage source file name.
Staging Directory
Path of the local staging directory. Ensure that the user has write permissions on the directory. In addition, ensure that there is sufficient space to enable staging of the entire file. Default staging directory is the
/temp
directory on the machine that hosts the Secure Agent. The temporary files are created within the new directory.
Multipart Download Threshold
Minimum threshold size to download an Oracle Cloud Object Storage object in multiple parts. To download the object in multiple parts in parallel, ensure that the file size of an Oracle Cloud Object Storage object is greater than the value you specify in this property.
Range
:
  • Minimum: 4 MB
  • Maximum: 5 GB
Default is 64 MB.
Download Part Size
Downloads the part size of an Oracle Cloud Object Storage object in bytes.
Range:
  • Minimum: 4 MB
  • Maximum: 1GB
Default is 32 MB.
SAP BW
The following table lists the options that you can configure for a SAP BW source object:
Option
Description
Packet size in MB
Size of the HTTP packet that SAP sends to the Secure Agent. The unit is MB. Default is 10 MB.
Package size in ABAP in rows
Number of rows that are read and buffered in SAP at a time. Default is 1000 rows.
Enable Compression
When selected, the ABAP program compresses the data in the gzip format before it sends the data to the Secure Agent. If the Secure Agent and the SAP system are not on the same network, you might want to enable the compression option to optimize performance. Default is not selected.
SAP Table
The following table lists the options that you can configure for a SAP ERP and SAP S/4 HANA source object:
Option
Description
Number of rows to be fetched
The number of rows that are randomly retrieved from the SAP Table. Default value of zero retrieves all the rows in the table.
Number of rows to be skipped
The number of rows to be skipped.
Packet size in MB
Packet size. Default is 10 MB.
Data extraction mode
You can use one of the following modes to read data from an SAP Table:
  • Normal Mode. Use this mode to read small volumes of data from an SAP Table.
  • Bulk Mode. Use this mode to read large volumes of data from an SAP Table. Use bulk mode for better performance.
For more information about the data extraction mode, see the Data Extraction mode section in the
Performance Tuning Guidelines for SAP Table Reader Connector
How-To Library article.
Enable Compression
Enables compression.
If the Secure Agent and the SAP System are not located in the same network, you may want to enable the compression option to optimize performance.
Update Mode
When you read data from SAP tables, you can configure a mapping to perform delta extraction. You can use one of the following options based on the update mode that you want to use:
  • 0- Full. Use this option when you want to extract all the records from an SAP table instead of reading only the changed data.
  • 1- Delta initialization without transfer. Use this option when you do not want to extract any data but want to record the latest change number in the Informatica custom table /INFADI/TBLCHNGN for subsequent delta extractions.
  • 2- Delta initialization with transfer. Use this option when you want to extract all the records from an SAP table to build an initial set of the data and subsequently run a delta update session to capture the changed data.
  • 3- Delta update. Use this option when you want to read only the data that changed since the last data extraction.
  • 4- Delta repeat. Use this option if you encountered errors in a previous delta update and want to repeat the delta update.
  • Parameter. When you use this option, the Secure Agent uses the update mode value from a parameter file.
Default is 0- Full.
For more information about the update mode, see the Update modes for delta extraction section in the SAP connector help.
Parameter Name for Update Mode
The parameter name that you defined for update mode in the parameter file.
Override Table Name for Delta Extraction
Overrides the SAP table name with the SAP structure name from which you want to extract delta records that are captured with the structure name in the CDPOS table.

0 COMMENTS

We’d like to hear from you!