Performance Tuning and Best Practices for Mass Ingestion Tasks

Back Next

Databricks DB SQL Cluster Size

When you configure a Databricks DB SQL file mass ingestion task to load large data sets to Databricks, the query performance can get impacted. To optimize performance when you configure load and query operations, you must dedicate separate warehouses. The number and capacity of clusters in a warehouse determine the numberof data files that can be processed in parallel.

You can split larger data files to allow the load to scale linearly. A smaller warehouse is generally sufficient unless you want to concurrently load a large number of data ranging from a few hundreds to thousands of files. A larger warehouse such as X-Large or 2X-Large consumes more credits and might not optimize the performance.

To understand the impact of database cluster size to process queries, refer to the guidelines and make appropriate changes depending upon the complexity of data and the amount of data that you want to process.

Cluster size example with ADLS Gen2 source and Databricks target that resides on MS Azure

In the example, the source is ADLS Gen2 and the target is Databricks that resides on Microsoft Azure.

You can see that the performance of the file mass ingestion task while writing to a Databricks target with different cluster size showed considerable performance improvement. The performance improved by nearly 3X.

Tune the Database

Download Guide

Watch

Comments

Cloud Mass Ingestion Homepage