Recommendations for the MDM Hub

Recommendations for the MDM Hub

You might be able to improve performance by changing how the MDM Hub operates within the environment.

MDM Hub Environment

The following table lists the recommendations related to the MDM Hub environment:
Parameter
Recommended Setting
Description
Obsolete
Data Director
applications and
Operational Reference Stores
databases
Remove obsolete items.
Obsolete
Data Director
applications and
Operational Reference Stores
schemas impact the performance of server startups, run time memory, and Security Access Manager profile caching.
Order of authentication provider in Security Providers
Configure the security providers in order with the first provider being the provider that authenticates the heaviest user load.
Configuration
Security Providers
Authentication Providers.
The MDM Hub authenticates the user based on the order of the security providers configured. If most of the users are authenticated by using the custom security provider (if applicable), it is recommended to move it to the first position.
Each authentication request has a cost of few milliseconds associated with it. The number of authentication requests is reduced significantly by using the User Profile Cache.

Application Server

The following table lists the recommendations for the application server configuration:
Parameter
Recommended Setting
Description
enable-http2
Disable
Applicable for JBoss. Disable http2 on the undertow listener in the
standalone-full.xml
file of JBoss. Default is true.
<http-listner name="default" socket-binding=" http"redirect-socket="https"enable-http2 =false"/>
See the Informatica Knowledge Base article number 618247 for further details.
Maximum thread count for the thread pool
300 or higher
For example, in the JBoss application set the following property in the
standalone-full.xml
file:
<thread-pools> <thread-pool name="default"> <max-threads count="300"/> </thread-pool></thread-pools>
Maximum connections in HTTP connection pool
300 or higher
For example, in the JBoss application set the following property in the
standalone-full.xml
file:
<connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http" max-connections="300"/>
JDBC logging level
OFF
For example, in the JBoss application, set the following log level property in the
standalone-full.xml
file:
<subsystem xmlns="urn:jboss:domain:logging:1.2">:
<logger category="com.microsoft.sqlserver.jdbc"> <level name="OFF"/></logger>
Transaction timeout
Greater than 3600 seconds.
Set the transaction timeout to at least 3600 seconds (1 hour).
For example, in the Boss application set the following property in the
standalone-full.xml
file:
<coordinator-environment default-timeout="3600"/>
Maximum beans instance pool size in JBoss
20 or higher.
The maximum beans pool size for the data source connection is based on the off-peak resource demand.
Stateless session bean pool. You must increase it to improve concurrency and performance of the Cleanse and CleansePut APIs.
In JBoss, set the following property in the
standalone-full.xml
file:
<bean-instance-pools>
<strict-max-pool name="slsb-strict-max-pool" max-pool-size="20" instance-acquisition-timeout="5" instance-acquisition-timeout-unit="MINUTES"/>
<strict-max-pool name="mdb-strict-max-pool" max-pool-size="20" instance-acquisition-timeout="5" instance-acquisition-timeout-unit="MINUTES"/>
</bean-instance-pools>
Load jobs and user exits in WebSphere
Update the JAR files.
Due to conflict in the IBM Java transaction JAR files with the MDM cleanse libraries, the Infinispan is not initialized for some components.
See the Informatica Knowledge Base article number 567591 for further details.

Operational Reference Store

The following table lists the recommendations for the ORS configuration:
Parameter
Recommended Setting
Description
Production Mode
Enable this property in Production.
[Configuration
Database
Database Properties].
Enable this property to remove additional overhead of pre-scheduled daemons that refresh the metadata cache.
Batch API Interoperability
Enable if both real time and batches are used. You can also enable the setting for concurrent real time API performance improvement.
[Configuration
Database
Database Properties].
Enabling this configuration has an impact on performance. Enable the configuration if batches and real time API calls are used or if
Data Director
is used. If your application uses neither real time API updates nor
Data Director
, do not enable API Batch Interoperability.
During Initial Data Load, disable this property for faster loading of data.
Auditing
Disable the auditing completely.
Auditing introduces additional overhead. You must disable auditing completely.
Write lock monitor Interval
cmx.server.writelock.monitor.interval=10
When more than one Hub Console uses the same ORS, a write lock on a Hub Server does not disable caching on the other Hub Servers. The unit is in seconds.
For more information, see the
Multidomain MDM Configuration Guide
.

Schema Design

The following table lists the recommendations for the schema design:
Parameter
Recommended Setting
Description
Child Base Objects
Avoid too many child base objects for a particular parent base object.
The performance of load, tokenize, and automerge batch jobs decreases as the number of child base objects for a base object increases.
Business Entities
Avoid creating cyclic or recursive relationship.
Ensure that you do not create cyclic relationship, such as A has a referenceOne link to B and B has a referenceOne link to A, to avoid recursive calls. Recursive calls affect the performance when these objects are referenced.
Match columns
Avoid too many match columns.
The performance of tokenize and match jobs decreases with the increase in the number of match columns.
Lookup Indicator
Enable Lookup Indicator only for 'Lookup' tables and not for any other base objects unrelated to lookup.
Schema
[base object]
Advanced
Lookup indicator.
Enabling lookup indicator for non-lookup base objects unnecessarily caches the base object data in the memory. Doing so results in out of memory errors, slow
Data Director
performance, and slower rate of lookup cache refresh.
Lookup Display Name
Configure Lookup Display Name to be the same as the lookup column.
For high volume lookup tables:
If you set the lookup display name to any column other than the column on which the relationship is built, SIF PUT calls must send the lookup display name values in the SIF call. When inserting data into the base object, the lookup value is validated by querying the lookup table. The order of lookup is predefined: the lookup display column value comes first followed by the actual column value second. In high volume lookup tables this becomes an overhead.
History
Enable History if you want to retain historical data for the specific base object. Otherwise, disable it.
If you enable History for a base object, the MDM Hub additionally maintains history tables for base objects and for cross-reference tables. The MDM Hub already maintains some system history tables to provide detailed change-tracking options, including merge and unmerge history. The system history tables are always maintained.
Over a period, history in the database keeps growing.
Consider keeping months or at most a few years of history in the system and to preserve database access performance.
History
To avoid very large history tables that cause performance issues, you can partition the tables.
For more information, search the
Informatica Knowledge Base
for article number 306525.
Cross Reference Promotion History
Enable Cross Reference Promotion History if you want to retain historical data for the specific base object.
Schema
[base object]
Advanced
Enable History of Cross Reference Promotion.
Enabling history incurs performance cost both to real time and to batch operations. Use the history option cautiously and if required.
Trust
Configure trust only for required columns.
A higher number of trust columns and validation rules on a single base object incur higher overhead during the Load process and the Merge process.
If the more trusted and validated columns are implemented on a particular base object:
Longer SQL statements (in terms of lines of code) are generated to update the
_CTL
control table and the
_VCT
validation control table.
Minimize the number of trust and validation columns to conserve good performance.
Case-insensitive Search
Enable case-insensitive search only for characters. For example, VARCHAR2 columns.
Ensure that you do not include any column with a data type other than character, such as VARCHAR in the search query.
Enabling case-insensitive search for non-character columns hinders performance. In the Microsoft SQLServer, if the collate is case-insensitive, do not enable this property.
Message Trigger Setup
Avoid configuring multiple message triggers for different event types.
Do an in-depth analysis before configuring message triggers. There is a performance cost associated with them during the execution of load jobs.
Tune Message Trigger Query.
The best approach to tuning the query used in the Package Views is to use Explain Plan. Add custom indexes wherever required to avoid full table scans, and analyze tables/schema on a regular basis. When you use Explain Plan, retrieve the plan by wrapping the query around an outer query that contains a "where" clause for a
rowid_object
equal to.
For more information about message triggers, search the
Informatica Knowledge Base
for article number 142115.
Throughput can be greatly improved if you increase the Receive Batch Size and reduce the Message Check Interval.
The Message Queue Monitoring settings have a major impact related to the throughput message posting time. Configure these settings from the Hub Console in the Master Reference Manager (MRM) Master Database
(CMX_SYSTEM)
in the Configuration section.
Avoid unnecessary column selection in message trigger.
Do not select "Trigger message if change on any column" if you do not need to monitor all the columns. Also, try to minimize the selection of columns.
'Read Database' Cleanse Function
Use the following recommended settings:
  • Use with caution.
  • Enable ‘cache’ if used.
The 'Read Database' cleanse function incurs a performance overhead compared to using a similar MDM Cleanse Function to perform the same function. The performance overhead is more pronounced on a high volume table. The overhead is caused by the creation of a new database connection and the corresponding transmit to, processing by, and receipt of the results from the database. These would otherwise be managed within the Process Server application layer.
If use of this function cannot be avoided, if applicable, enable caching behavior of the Read Database function. Pass a Boolean 'false' value to the 'clear cache' input field of the Read Database function'. Doing so reduces performance lag by enabling future operations to use the cached value rather than creating a new database connection on each access of the function.
Cleanse Functions
Do not make it very complex.
The performance of batch jobs increases with reduced number and reduced complexity of cleanse functions.
Timeline (Versioning)
If it is strictly required, you must enable dynamic timeline or versioning only on an entity base object. The base object must be a regular base object and not a base object with Hierarchy Manager Relationship.
Versioning has a performance impact on the base object associated with it.
For Hierarchy Manager Relationship base objects, you must enable versioning without an option to disable it.
Versioning must only be enabled on those Entity base objects (regular MDM Hub base object) which strictly need it to maintain the fastest performance possible. With the Versioning functionality, the additional associated metadata and processing carry a significant amount of complex processing when running any process on a version-enabled base object. Enabling versioning on a base object brings an additional performance cost to all processing performed on that base object.
For more information, search the
Informatica Knowledge Base
for article numbers 138458 and 140206.
State Management
Disable state management if you do not require it.
State Management carries an associated performance overhead.
If you use
Data Director
with workflows, you must enable State Management.
However, enabling History for State Management Promotion at the cross-reference level is optional.
Delta Detection
Enable it only on the minimum number of columns that strictly need it.
Delta Detection carries a sizable associated overhead on performance.
If a Landing Table has only new and updated records in every staging job, you can disable delta detection. If you want to enable Delta Detection the least impactful approach is to use the
last_update_date
. If you need additional columns, for each additional column you enable, analyze if the involvement of this additional column is worth the associated performance overhead. Avoid blindly enabling Delta Detection on all the columns.
Cleanse Mappings
Minimize the complexity of Mappings.
Minimized the complexity in the mapping to have better performance.
If you use a Cleanse List in a Cleanse mapping, use static data in the Cleanse List.
Consider using lookup tables only for dynamic data.
Validation Rules
Optimize Validation Rule SQL code.
The detection piece of each Validation Rule SQL runs against every record during the Load process to determine if it applies.
Poorly performing Validation Rule SQL influences performance on every Loading record.
User Exits and External Calls
Optimize user exit or external call code for performance.
User exit or external call code influences performance if it is not optimized.
Applicable to
Data Director
, batch user exits, and external calls. See the best practices when you implement user exits and external calls.
Packages
Optimize the SQL code written in each MDM Queries which is called from an MDM Hub Package.
These MDM Hub Packages are used in
Data Director
, SIF API calls, Data Manager, Merge Manager, and in search operations.
If you do not tune the MDM Hub for performance, it results in an expensive operation whenever it is called.
Custom Indexes
Use caution when adding custom indexes. Each index added has an associated cost. Ensure that gain received outweighs the cost of each additional custom index.
Index management has a performance cost associated.
Perform the following steps to improve the performance cost:
  1. Get a log of the real queries run on the base object (content data) and base object shadow tables (content metadata) on a typical day. Ignore temporary
    T$%
    tables and system
    C_REPOS_%
    tables;
  2. Identify indexes which exist on these tables to avoid unnecessary overlap.
  3. Before adding any indexes, review a regular day of logs and take an inventory of:
    1. SIF API or BES call duration.
    2. Data Director
      process durations.
    3. Batch jobs:
      1. Duration of each batch job.
      2. Duration of each cycle within that batch job.
      3. Duration of longest running statements within a batch job.
  4. Consider the longest running process for potential benefit from a custom index.
  5. Consider adding indexes so the longest running SQL query or queries hit the new index in their execution plan. Avoid indexing fields which have many updates or inserts.
After each new custom index added return to Step 3 and assess if there is still potential to improve performance through adding more custom indexes.
Parallel Degree on Base Object
Between one and number of CPU cores on database machine.
Parallel degree is an advanced base object property. For optimum performance of batch jobs, set a value between one and the number of processor cores on the database server machine.
For more information, search the
Informatica Knowledge Base
for article number 181313.

Match and Merge

The following table lists the recommendations for the match and merge configuration:
Parameter
Recommended Setting
Description
Match Path Filter
Filter on match path instead of at the match rule level.
If you need to exclude records from the match process, filter on the appropriate match path instead of at the match rule level.
When you filter at the match path level, it excludes the records from tokenization and they, therefore, do not participate in the match.
Check for missing children
Use it with caution.
This match patch property indicates if parent records must be considered for matching based on the existence of child records.
If you need a fuzzy match on a base object, tokenization of a parent base object record must occur. Tokenization of a parent base object record occurs if all child base objects that have the option to check for missing children disabled have a related child base object record. If a parent base object record has a child, where the option to check for missing children is disabled yet contains no record, the parent record is not tokenized.
The MDM Hub performs an outer join between the parent and the child tables when the option to check for missing children is enabled. This option has an impact on the performance on each match path component on which the option is enabled. Therefore, when not needed, it is more efficient to disable this option.
Match Key
The tighter the key the better the performance.
The width of the match key determines the number of rows in the tokenization table (number of tokenized records which are used to match each record to be matched) and the number of records to be considered for each match candidates. Usually the standard key width is enough.
Search Level
: Use the narrowest possible search level to generate acceptable matches. Usually the typical search level is enough.
Match Rules
: For each match rule, add one or more exact match columns to act as a filter to improve the performance for each rule.
Match Rule
Search Level: Narrow
Match or Search Strategy: Filtered
Search Level. To generate acceptable matches, you must use the narrowest possible search level. You can run a sample test with the required level to avoid under matching. Based on your requirement, you can fine-tune the settings.
For each match rule, add one or more exact match columns that acts as a filter. The process might improve the performance.
Filtered. Use filtered match rule instead of exact match rule to make use of fuzzy filtering on exact columns. The filtered rule enables bulk processing in the applications server rather than the database server. For implementing match rule, see the Informatica Knowledge Base article number 121044.
Dynamic Match Analysis Threshold (DMAT)
Change if required.
Default is 0.
To set the DMAT, identify any large ranges that cause bottlenecks. After you identify the ranges, use the "Comparison Max Range" count from the logs.
Proper analysis is required to change this value. See the Informatica Knowledge Base article number 90740
STRIP_CTAS_DELETE_RATIO
Change if required.
Default is 10%.
C_REPOS_TABLE [base object]
STRIP_CTAS_DELETE_RATIO.
Proper analysis is required if you decide to change the default 10% value. See the Informatica Knowledge Base article number145312.
If the volume of data change in the
_STRP
table is more than this percentage, tokenization would instead use 'Create Table … As Select …' code to recreate the
_STRP
table with the needed changes rather than delete and/or insert operations to arrive at the same result in less time. The optimal value might vary for each implementation and it depends on the size of the table and the percentage of records that must be updated.
STRIP_CTAS_DELETE_UPPER_LIMIT
10% of the BO records.
You must set the value above the peak daily incremental delta, such as 1 million records, or more.
COMPLETE_STRIP_RATIO
Change if required.
Default is 60%.
[Model
Schema
[base object] Advanced
Complete Tokenize Ratio].
Proper analysis is required if you decide to change the default 60% value.
If the volume of data change in the
_STRP
table is more than this percentage, the tokenization process would drop and re-create the entire
_STRP
table rather than (re)tokenizing only the updated records.
AUTOMERGE_CTAS_RATIO
Change if required.
Default is -1.
C_REPOS_TABLE [base object]
AUTOMERGE_CTAS_RATIO.
Proper analysis is required before you decide to change the default -1 value.
If the volume of records queued for merge is greater than this percentage, automerge uses the 'Create Table As Select' option for faster merging instead of the regular delete/insert operations.
Default value is -1 which indicates that this Create Table as Select (CTAS) feature is OFF for Automerge.
For more information about tuning match and merge, search the
Informatica Knowledge Base
for article number 357214.

Services Integration Framework
(SIF) APIs

The following table lists the recommendations for the SIF API:
Parameter
Recommended Setting
Description
Protocol
Use EJB protocol over HTTP or SOAP.
EJB Protocol is faster and reliable.
See the Informatica Knowledge Base article number 138526 for further details.
Disable Paging
Set to True if not required.
SIF Request parameter: disablePaging
If paging is not required (if results are not going to return many records), it is better to set this flag to true. If set to false (default value), this would incur two database calls for each SIF call.
Return Total
Do not set any value.
SIF request parameter: returnTotal
If you do not require total count, it is better to not set this flag. If set to true, it would incur two database calls for each SIF call.

Hub Server Properties

The following table lists the recommendations for the Hub Server properties:
Parameter
Recommended Setting
Description
User Profile Cache
True
Default is true.
cmx.server.provider.userprofile.cacheable
This property is found in the
cmxserver.properties
file.
When you set this flag to true, once a user profile is authenticated, it is cached. Set the flag to true to suppress the need for explicit user authentication requests for every SIF or BES call.
User Profile Life Span
60000
Default is 60000.
cmx.server.provider.userprofile.lifespan
This property is found in the
cmxserver.properties
file.
Time to retain the cached user (milliseconds) before refreshing. A few minutes is adequate-avoid setting this to longer durations.
Security Access Manager (SAM) cache refresh interval
5 clock ticks
Default is 5 clock ticks at a rate of 60,000 milliseconds for 1 clock tick, which is equivalent to 5 minutes.
cmx.server.sam.cache.resources.refresh_interval
This property is found in the
cmxserver.properties
file.
Refreshes the SAM cache after the specified clock ticks. To specify the number of milliseconds for 1 clock tick, use the
cmx.server.clock.tick_interval
property.
For more information, see the
Multidomain MDM Configuration Guide
.
Cleanse Poller
30 (seconds)
Default is 30.
cmx.server.poller.monitor.interval
This property is found in the
cmxserver.properties
file.
For every number of seconds configured, the MDM Hub Server would poll the availability of the Process Server and accordingly flag the status of the Process Server as valid or invalid.
For more information, search the
Informatica Knowledge Base
for article number 151925.

Infinispan Metadata Caching

The following table lists the recommendations for Infinispan metadata caching parameters, which are available in the following file:
<MDM Hub installation directory>/hub</cleanse>/server/resources directory/infinispanConfig.xml
:
Parameter
Recommended Setting
Description
expiration lifespan
86400000 (milliseconds)
Maximum lifespan of a cache entry in milliseconds. When a cache entry exceeds its lifespan, the entry expires within the cluster.
You can increase the lifespan for the following caches: DISABLE_WHEN_LOCK, DATA_OBJECTS, and REPOS_OBJECTS. For example, you can increase a lifespan from one hour (3600000) to one day (86400000).
Each cache has its own default value for this parameter. To find the default values, open the
infinispanConfig.xml
file.
expiration interval
300000 (milliseconds)
Maximum interval for checking the lifespan.
For example, you can increase an interval from five seconds (5000) to five minutes (300000).
For more information about Infinispan parameters, search the
Informatica Knowledge Base
for article number 509572.

Logging

The following table lists the recommendations for the logging:
Parameter
Recommended Setting
Description
Hub Server Logging
Set to ERROR mode.
Change the
log4j.xml
file to use the ERROR mode. If clustered, update the
log4j.xml
file in all nodes under
<INFAHOME>/hub/server/ conf/log4j.xml
After the log4j configuration file is updated, you can see the changes reflected in the log within a few minutes.
Process Server Logging
Set to ERROR mode.
Change the
log4j.xml
file to use the ERROR mode. If clustered, update the
log4j.xml
file in all nodes under
<INFAHOME>/hub/cleanse/ conf/log4j.xml
After the log4j configuration file is updated, you can see the changes reflected in the log within a few minutes.
For more information about logging, search the
Informatica Knowledge Base
for article number 120879.

Search

The following table lists the recommendations for search:
Parameter
Recommended Setting
Description
Limit Searchable Fields
Do as needed.
Do not index unnecessary searchable fields. Multiple searchable fields increase the indexing and searching time, so configure only the required fields as searchable fields. Also keep only the required fields and facets. Facets should only be on the fields with low entropy. Also limit the number of fuzzy fields.

Task Assignment

The following table lists the recommendations for task assignments:
Parameter
Recommended Setting
Description
task.creation.batch.size
Default is 1000.
In MDM 10.0 and earlier, the default value is 50.
Available in
cmxserver.properties
.
Sets the maximum number of records to process for each match table.
If more tasks need to be assigned on the run, you can increase this value.

Operational Reference Store
and SIF APIs

The following table lists the recommendations for ORS-specific SIF API generation:
Parameter
Recommended Setting
Description
Required objects
As needed.
ORS specific API generation depends on the number of objects selected. It is preferable to add only the required objects to gain performance during the SIF API generation.
SIF API (Java Doc Generation) Heap Size
Default is 256m.
sif.jvm.heap.size
Available in
cmxserver.properties
.
Sets the heap size used during the creation of Java Doc. As Java Doc creation takes a lot of heap memory, you can increase this to a higher value if required. Note that this heap size setting is not connected to the heap size of the MDM application, which is set during the server startup.

Informatica Data Quality

The following table lists the recommendations for
Informatica Data Quality
cleansing:
Parameter
Recommended Setting
Description
Batch Size
Default is 50.
cmx.server.cleanse.number_of_recs_batch
Available in
cmxcleanse.properties
.
If the workflow supports minibatch, then you can set this value to any desirable value depending on the number of records to be cleansed at a time. If this attribute is set, then MDM automatically groups the records for cleansing.
This property can be used in other cleansing engines if they support minibatch.
For more information, search the
Informatica Knowledge Base
for article number 153419.

Initial Data Load

The following table lists the recommendations for the initial data load for database or environment settings:
Parameter
Recommended Setting
Description
Distributed Matching
1
Set the
cmx.server.match.distributed_match=1
parameter in the
cmxcleanse.properties
file. This often improves performance by spreading the match load across multiple servers. Ensure that you configure the Cleanse Match servers to perform the match process in the batch mode on servers that you expect to spread the match.
UNBO Tablespace
Sufficient UNDO tablespace must be available.
The Match process utilizes a lot of UNDO tablespace. Ensure that sufficient UNDO tablespace is available in the database. You must adjust your batch match size accordingly.
Database Archive Log
Off
During IDL, switch off or disable to improve the performance.
Batch API Interoperability
Disable
During IDL, you must disable this setting.
Application Server Performance
See the sections, such as JVM , thread counts, and block size, in this guide.
Log4j.xml in siperian.performance
Off
Switch OFF the
siperian.performance
parameter in the
log4j.xml
file. For better performance, you must also ensure that the default async appenders are used in all the levels.
Match Setting
Match Only Previous Rowid Objects. Enable it during the initial load match process for additional performance gain. Ensure that you disable this property for the incremental match process.
Match Only Once. Enable it during the initial match job. You must assess the trade-off between performance and match quantity based on the characteristics of your data and your particular match requirements.
Match Rule Settings
Default settings
You must use the default settings of the match configuration. If you want to modify the setting, you must analyze the requirement and implement an extensive testing.
Connection Pool Size
Ensure that sufficient connection pool is available for the sizing. See the connection pool recommendations in this guide.
Custom Indexes
Disable
Disable the custom indexes.
Constraints
Disable
To disable the constraints on NI indexes on the base object and all the indexes on XREF, you must choose the option
Allow constraints to be disabled
.
History
Disable
Disable history.
Analyze Schema
Prior to the initial data load, analyze the database schema.
Production Mode
Enable
Enable the production mode flag.
Elasticsearch Server
Disable
During the initial data load with high volume of data, you must disable the Elasticsearch server.
During incremental load, you must enable the Elasticsearch server.
For further information, see the Informatica Knowledge Base article numbers 158622 and 158822.

0 COMMENTS

We’d like to hear from you!