When you configure a mass ingestion job to write to a target, the number of files that you want to write with each COPY command to the target impacts performance.
Specify a batch size for the maximum number of files to be transferred in a batch in the source properties of the mass ingestion task. When you specify a batch size, the performance of the task is optimized.
The default batch size is 5. When you use Snowflake Cloud Data Warehouse V2 Connector to write from Amazon S3 or Azure Blob Storage sources to a Snowflake target, you can specify a maximum batch size of 1000 in the Amazon S3 V2 or Azure Blob Storage V3 source properties.
Similarly, when you use Databricks DB SQL Connector to write from Amazon S3 or Azure Blob Storage sources to a Databricks target, you can specify a maximum batch size of 1000 in the Amazon S3 V2 or Azure Blob Storage V3 source properties.
For other mass ingestion supported sources, you must specify a batch size between 1 and 20.
The following examples illustrate the optimization of a mass ingestion task when you configure a batch size to write to a target.
Impact of Batch Size on Mass Ingestion Performance - Amazon S3 Source and Snowflake Target
The following graph depicts the impact of the batch size on the performance of a mass ingestion task when the source is Amazon S3 and the target is Snowflake:
You can observe an improvement of around 10X in the writer performance when you specify an appropriate batch size.
Impact of Batch Size on Mass Ingestion Performance - Local Flat File Source and Snowflake Target
The following graph depicts the impact of the batch size on the performance of a mass ingestion task when the source is a local flat file and the target is Snowflake:
You can observe around 4X improvement in the mass ingestion performance with an increase in the batch size.
Impact of Batch Size on Mass Ingestion Performance - Azure Blob Source and SQL Data Warehouse Target
The following graph illustrates the impact of the batch size on the performance of a mass ingestion task when the source is Azure Blob and the target is SQL Data Warehouse:
NOPQ (Number of Polybase Queries) = Number of Files/Batch Size.
You can observe around 2X improvement in the writer performance.
Performance Impact between a Mass Ingestion Task and Mapping Task - Flat File Source and Microsoft Azure Data Lake Store Target
The following graph illustrates the impact of the batch size in a mapping task and a mass ingestion task when the source is Azure Blob and the target is SQL Data Warehouse:
The performance of the mass ingestion task is 9X times faster than a mapping task.
Impact of Batch Size on Mass Ingestion Performance - ADLS Gen2 Source and Databricks Target
The following graph depicts the impact of the batch size on the performance of a mass ingestion task when the source is ADLS Gen2 and the target is Databricks:
You can observe an improvement of around 2X in the writer performance when you specify an appropriate batch size.
Impact of tuning number of parallel batches along with batch size
The following graph illustrates the impact of tuning number of parallel batches with batch size on the performance of a mass ingestion task when the source is ADLS Gen2 and the target is Databricks DB SQL:
Where NOPB is the number of parallel batches.
You can observe a significant improvement in the writer performance.
Guidelines for optimizing the mass ingestion task performance to a Microsoft Azure Data Lake Store target
Consider the following guidelines to optimize the performance of the mass ingestion task while uploading files to a Microsoft Azure Data Lake Store target:
The batch size must not exceed a maximum size of 20 when the source is a flat file.
The batch size must be less than or equal to the number of VCores in the agent virtual machine.
To optimize performance, the recommended JVM maximum heap size must be 1 GB to avoid garbage collection overhead and out-of-memory errors. When you use the default JVM heap size, the mass ingestion task might fail with an out-of-memory error. If you do not specify an appropriate JVM maximum heap size, the mass ingestion performance might also be impacted.
Guidelines for optimizing the mass ingestion task performance to a Databricks DB SQL target
Consider the following guidelines to optimize the performance of the mass ingestion task while writing data to a Databricks DB SQL target:
The batch size must not exceed a maximum size of 1000 when the source is a ADLS/S3.
For large datasets, bigger size clusters gives better performance.
General guidelines for configuring the batch size
Consider the following guidelines to optimize performance:
When you specify a batch size, you must also ensure that the source file is split into multiple smaller files so that performance is optimized.
For optimized performance, batch size should be equal to or close to the number of files to be loaded in the target.
Increasing the number of parallel batches with batch size can further improve performance.
For information about the appropriate batch size value that you can specify for different sources and targets in a mass ingestion task, see the
Local folder source properties topic in the Mass Ingestion help.