Application and Database Design Guide

Application and Database Design Guide

Transliteration and Data Matching

Transliteration and Data Matching

Transliteration can assist with data retrieval and data matching of identity data stored in foreign scripts, however, there are good and bad techniques.
Do not expect to achieve reliability and performance by transliterating multiple foreign scripts into a common character set and applying a localized matching algorithm to the result. There is too much conflict and compromise in the rules. Search and matching on data from different countries and languages should be handled by algorithms tuned for each country/language.
Even a technique that attempts to detect language source in transliterated data to choose strategies and algorithms has inherent problems. How does one safely choose the language source for the name "Mohammed Smith" or "CharlesWong"?
If original script and/or informally transliterated data is available, do not discard it; such data provides an additional source of information useful for search and matching.
The real value of transliteration and transliterated data is when it is used in conjunction with the source language. A solution that indexes, searches and matches on all available forms, uses this inherent redundancy to multiply the opportunity for success.

0 COMMENTS

We’d like to hear from you!