To achieve the best performance for Big Data Management on the Azure cloud, consider the following best practices and recommendations:
Create the following resources in the same geographic location and vnet:
Azure SQL databases for the domain, monitoring, and Model repositories
Azure VM for the Informatica domain
Azure Storage (ADLS or GPv2)
HDInsight cluster
Azure Windows VM with Developer tool installation
Choose between ADLS or General-Purpose Storage (GPv2) for persistent data storage, depending on your use case. For example, ADLS is more commonly used for a data analytics use case.
With data residing in ADLS or GPv2, you can terminate the HDInsight cluster with a Delete Cluster task after the job is completed, providing significant cost savings.
To replicate data in Azure Storage in different locations, use cross-regional replication with RA-GRS. RA-GRS replicates your data to another data center in a secondary region and also provides you with the option to read from the secondary region. See the
Azure documentation.
Spark shuffle service is enabled by default if you select Spark as the cluster type during the HDInsight cluster configuration process. Chose Spark version 2.3.0 (HDI 3.6).