Recommendations for Batch Job Optimization

Recommendations for Batch Job Optimization

A batch job is a program in the MDM Hub that you can run to complete a discrete unit of work. You can launch batch jobs individually or as a group from the Hub Console or with the SIF APIs. You can configure settings to optimize the performance of batch jobs.
The following table lists the different batch job parameters and their recommended settings to achieve a base-level performance:
Parameter
Recommended Setting
Description
Cleanse Thread Count
Used in the following batch jobs:
  • Match Job
  • Generate Match Tokens process on Load job
  • Stage job
Start with the number of cores available. Based on CPU utilization, number of threads can be increased.
Default is 1.
Available in
“Process Server
Threads for Cleanse Processing”.
Total number of threads used by the Master or Slave Process Server when executing. Generate Match Tokens after Load, Match, and Stage jobs.
Threads for Batch Processing
Used in the following batch jobs:
  • Automerge Job
  • Load Job
  • Batch Delete
  • Batch Unmerge
  • Batch Revalidate
Specify a value that is equivalent to four times the number of CPU cores on the system on which the Process Server is deployed.
Default is 20.
Available in
“Process Server
Threads for Batch Processing”.
Maximum number of threads to use for a batch process.
For example, if the host machine has 16 CPU cores, set the Threads for Batch Processing in the Process Server registration to 64. Applicable only if the Process Server is marked for batch processing.
From the total number of threads available on the Process Server, dedicate n threads for Batch jobs by setting a value for the property number of threads for Batch processing.
Controller Thread Time Out
Used in the following batch jobs:
  • Automerge Job
  • Load Job
  • Batch Delete
  • Batch Unmerge
  • Batch Recalculate
300000 (5 minutes).
Default is 300000.
com.informatica.mdm.loadbalance.ControllerThread.timeout
This property is found in the
cmxcleanse.properties
file.
When distributing the load to different slave Process Servers, after the last block is sent to a slave Process Server, all slave Process Servers which are processing the blocks MUST complete the job within the timeout period.
If not completed, such blocks are marked with ‘No Action’ in the batch result. Note that, the batch is not marked as failed because the remaining blocks are successfully loaded.
Load analyze threshold rate
Used in the following batch jobs:
  • Automerge Job
  • Load Job
  • Batch Delete
  • Batch Unmerge
  • Batch Recalculate
Default is 10.
cmx.server.batch.load.analyze_threshold_rate
Available in
cmxserver.properties
For ORACLE only. Available from MDM 10.0 HotFix 1.
Specifies the frequency that the MDM Hub gathers analytical statistics for tables affected by a batch Load job. Set to 0 to disable statistic collection. Set to 1 to collect statistics only at the end of a Load job for base object and cross-reference tables.
For example, if the threshold is 10, then statistics would be gathered at every 10^n records. For example, new statistics would be gathered whenever the insert record count reaches 100, 1000, 10000, and so on.
Recycler Thread Max Idling
Used in the following batch jobs:
  • Automerge Job
  • Load Job
  • Batch Delete
  • Batch Unmerge
  • Batch Recalculate
300000 (5 minutes).
Default is 300000 (5 minutes).
com. informatica.mdm.batchserver.RecyclerThread.max_idling
This property is found in the
cmxcleanse.properties
file.
If a slave Process Server is processing a block of batch job and is idle for a duration specified in this attribute then the specific thread is marked as 'dead.'
If a slave Process Server is timed out as noted earlier, the corresponding block is marked with ‘No Action’ in the batch result. Note that the batch is not marked as failed as the remaining blocks are successfully loaded.
Automerge
:
Automerge Threads Per Job
Default is 1.
cmx.server.automerge.threads_per_job
This property is found in the
cmxserver.properties
file.
Maximum number of threads distributed across different Process Servers to process the automerge job.
For example, if this value is 20, automerge would be distributed across two Process Servers each with 10. The distribution depends on factors such as CPU weightage of the Process Server and other jobs running on the Process Server.
This value must be less than the value in 'Threads for Batch' attribute specified for the Process Server.
The optimum value for a database server with a 16 core processor and a solid-state drive (SSD) set up in a RAID is 20. Based on CPU utilization on different Process Servers, you can increase the threads.
Automerge
:
Automerge Block Size
Default is 250.
cmx.server.automerge.block_size
This property is found in the
cmxserver.properties
file.
Maximum number of records to be sent for merges to each Process Server in one block.
For example, consider the scenario of two Process Servers with 1000 records to be merged. If this value is 250, each Process Server gets 250 records first followed by another 250 records next.
Increasing this value can provide performance improvement based on how powerful the application servers and database servers are.
Automerge
:
STRP table management
Default is 10.
cmx.server.batch.strp_upd_threads_per_job
This property is found in the
cmxcleanse.properties
file.
Applicable for IBM DB2.
For larger data set of data participating in Automerge, you can make faster updates to the STRP table by increasing this thread count based on the CPU of your Process Server.
Load
:
Batch Threads Per Job
Default is 1.
cmx.server.batch.threads_per_job
This property is found in the
cmxserver.properties
file.
Maximum number of threads distributed across different Process Servers to process the load job.
For example, if this value is 20 then load process would be distributed across two Process Servers each with 10. The distribution depends on factors such as CPU weightage of the Process Server and other jobs running on the Process Server.
This value must be less than the value in 'Threads for Batch' attribute specified for the Process Server.
The optimum value for a database server with a 16 core processor and a solid-state drive (SSD) set up in a redundant array of independent disks (RAID) is 20. Based on CPU utilization on different Process Servers, you can increase the threads.
Load
:
Batch Block Size
Default is 250.
cmx.server.batch.load.block_size
This property is found in the
cmxserver.properties
file.
Maximum number of records to be sent for load, to each Process Server in one block.
For example, consider the scenario of two Process Servers with 1000 records to be loaded. If this value is 250, each Process Server gets 250 records first followed by another 250 records next.
Increasing this value can provide performance improvement based on how powerful the application servers and database servers are.
Load
:
Threads per job for generate tokens, if 'Generate Match Tokens on Load' attribute is enabled on the base object
Same as "Threads for cleanse processing".
See 'Threads for Cleanse Processing' attribute described earlier.
Note that, this thread attribute is different from the core threads per job attribute of the load job described earlier.
If 'Generate Match Tokens on Load' is not selected, this attribute does not have any impact on the performance of the Load job.
Batch Recalculate (SIF API Request)
:
Recalculate Threads Per Job
Same property, re-used from LOAD Job. See LOAD Job section for more details.
cmx.server.batch.threads_per_job
This property is found in the
cmxserver.properties
file.
Same property, re-used from LOAD Job. See LOAD Job section for more details.
Batch Recalculate (SIF API Request)
:
Recalculate Block Size
Default is 250.
cmx.server.batch.recalculate.block_size
This property is found in the
cmxserver.properties
file.
Maximum number of records to be sent, to recalculate BVT, to each Process Server in one block.
For example, consider the scenario of two Process Servers with 1000 records to be recalculated. If this value is 250, each Process Server gets 250 records first followed by another 250 records next.
Increasing this value can provide performance improvement based on how powerful the application servers and database servers are.
Batch Recalculate (SIF API Request):
Threads Per Job
Same property, re-used from LOAD Job. Refer to LOAD Job section for more details.
cmx.server.batch.threads_per_job
Available in
cmxserver.properties
Same property, re-used from LOAD Job. Refer to LOAD Job section for more details.
Batch Unmerge (SIF API Request)
:
Unmerge Block Size
Default is 250.
cmx.server.batch.batchunmerge.block_size
This property is found in the
cmxserver.properties
file.
Maximum number of records to be sent for unmerges, to each Process Server in one block.
For example, consider the scenario of two Process Servers with 1000 records to be unmerged. If this value is 250, each Process Server gets 250 records first followed by another 250 records next.
Increasing this value can provide performance improvement based on how powerful the application servers and database servers are.
Batch Delete (SIF API Request)
:
Threads per job
Same property, re-used from LOAD Job. See LOAD Job section for more details.
cmx.server.batch.threads_per_job
This property is found in the
cmxserver.properties
file.
Same property, re-used from LOAD Job. See LOAD Job section for more details.
Batch Delete (SIF API Request)
:
Delete Batch Block Size
Default is 250.
cmx.server.batch.delete.block_size
This property is found in the
cmxserver.properties
file.
Maximum number of records to be sent for deletion, to each Process Server in one block.
For example, consider the scenario of two Process Servers with 1000 records to be deleted. If this value is 250, each Process Server first gets 250 records and then another 250 records.
Increasing this value can provide performance improvement. This performance improvement depends on how powerful the application servers and database servers are.
Tokenize
:
Tokenization File Loader Option
Default is true.
cmx.server.tokenize.file_load
This property is found in the
cmxcleanse.properties
file.
Applicable for Oracle and DB2.
If true, DB2 file loader or Oracle SQL Loader is used to load the records during the tokenization job.
If file writing is causing performance issue, this can be changed to false, thereby, data is directly written to the database every time instead of file loader option. Generally, file loader is faster than the direct database write. You might choose the option according to your environment.
Tokenize
:
STRP table management
Default is 10.
cmx.server.batch.strp_ins_threads_per_job
Available in
cmxcleanse.properties
.
Currently applicable only for DB2.
For larger dataset, Insertion to the STRP table can be made faster by increasing this thread count based on the CPU of your Process Server.
Tokenize
:
STRP table management
Default is 10.
cmx.server.batch.strp_del_threads_per_job
Available in
cmxcleanse.properties
.
Currently, applicable only for DB2.
For larger dataset, deletion from the STRP table can be made faster by increasing this thread count based on your Process Server's CPU.
Tokenize
:
STRP table management
Default is 2500.
cmx.server.batch.sql.block_size
This property is found in the
cmxcleanse.properties
file.
Currently applicable only for DB2.
For larger data set, you can make faster inserts or deletes on the STRP table by increasing this block size. Each thread would insert or delete as many number of records as specified in this block size.
For example, for an insert of 50 k records, with 10 threads and a block size of 2.5 k, each thread might run a SQL batch of 2.5 k records twice (roughly).
Stage
:
Threads per job
See 'Cleanse Thread Count' attribute described earlier.
See 'Cleanse Thread Count' attribute described earlier.
Stage
:
Cleanse Minimum Distribution
Default is 1000.
cmx.server.cleanse.min_size_for_distribution
This property is found in the
cmxcleanse.properties
file.
The MDM Hub distributes the cleanse job across different cleanse server only if the number of records is higher than this minimum size.
When distributing the load, each slave Process Server would use the Cleanse Thread Count for the number of worker threads.
Stage
:
Stage JDBC Loader
Default is false.
Usually, file writing must be faster than the direct database writing.
cmx.server.java_jdbc_loader
Applicable for Oracle and DB2.
Default is false.
This property is found in the
cmxcleanse.properties
file.
If true, DB2 and Oracle use direct database connections during the stage job instead of DB2 file loader or Oracle SQL loader options
If file writing is causing performance issue, this can be changed to true. On doing so, data gets directly written to the database every time instead of file loader option. Note that, generally, file loader is faster than the direct database write. You might choose the option according to your environment.
Match
:
Threads per job
See 'Cleanse Thread Count' attribute described earlier.
See 'Cleanse Thread Count' attribute described earlier.
Match
:
Match Distribution Flag
Enable this flag to 1, if the MDM Hub has to distribute the match job load across different cleanse servers.
cmx.server.match.distributed_match
This property is found in the
cmxcleanse.properties
file.
The MDM Hub distributes the match job across different cleanse server only if this value is set to 1.
When distributing the load, each slave Process Server would use the Cleanse Thread Count for the number of worker threads.
Match
:
Match File Loader Option
Default is true.
Usually, file writing must be faster than the direct database writing.
cmx.server.match.file_load
Applicable for Oracle and DB2. Default is true.
This property is found in the
cmxcleanse.properties
file.
If true, DB2 file loader or Oracle SQL Loader is used to load the records during the tokenization job.
If file writing is causing performance issue, this can be changed to false, thereby, data will be directly written to the database every time instead of file loader option. Generally, file loader is faster than the direct database write. You might choose the option according to your environment.
Match
:
Match Loader Batch Size
Default is 250.
cmx.server.match.loader_batch_size
This property is found in the
cmxcleanse.properties
file.
Applicable if JDBC load is used in match processing instead of file loader option.
Maximum number of records to be sent for match in each worker thread.
Increasing this value can provide performance improvement based on how powerful the application servers and database servers are.
Match
:
Match Elapsed Time
Default is 20 (minutes).
Hub Console
Base Object
Max Elapsed Match Minutes.
The execution timeout in minutes when executing a match rule. If this time is reached, the match process will exit. This must be increased only if the match rule and the data are very complex. Generally rules must be able to complete within 20 minutes.
Match
:
Match Batch Size
Default is 20000000.
Hub Console
Base Object
Match/Merge Setup
Number of rows per match job batch cycle.
Maximum number of records to be processed by the MDM Hub for matching. This number would affect the duration of match process.
Also, lower the match batch size, you have to run the match process more times.
When running large Match jobs with large match batch sizes, if there is a failure of the application server or the database, you must re-run the entire batch.
Match
:
Maximum records per ranger node
Default is 5000.
max_records_per_ranger_node
This property is found in the
cmxcleanse.properties
file.
Number of records per match ranger node (limits memory use). Ranger is an internal component used within the match process where sorting and merging operations are performed based on this maximum records attribute.
You can optimize this value to get better performance based on the memory available in your application server.
Initially Index Smart Search Data:
Block Size
10000.
Default is 250.
cmx.server.batch.smartsearch.initial.block_size
Available in
cmxserver.properties
.
Maximum number of records that the "Initially Index Smart Search Data" batch job can process in each block. This property is not applicable through regular indexing outside this specific batch job.
When you index a large data set, you can set the value to 10000.
This property is available only from MDM 10.0 Hot Fix 2.
Initially Index Smart Search Data:
Smart search threads
Default is 1.
Same property, re-used from LOAD Job.
cmx.server.batch.threads_per_job
Available in
cmxserver.properties
.
Maximum number of threads distributed across different Process Servers to process the batch job "Initially Index Smart Search Data". You can increase this value to achieve more performance during this batch job. This property is not applicable for regular indexing outside this specific batch job.

0 COMMENTS

We’d like to hear from you!