Text processing functions make text data cleaner and semantically more consistent for
vector embedding by focusing on words that are informative to the meaning of the text and by
reducing variability to aid NLP. In RAG use cases, text processing ensures that text is
clean, consistent, and easily comparable to user queries.
Text processing functions can clean text by removing noise such as whitespace and
diacritics, and they can convert text to a standard format by lemmatizing words to their
base forms.
You can use the following text processing functions:
Cleanse text
Cleanse the text by removing redundant whitespace and sets of dots and by
converting letters to lowercase.
Remove diacritics
Removes diacritics including accents and other marks that change a letter's
pronunciation. For example,
café
becomes
cafe
.
Check spelling
Checks for spelling errors based on the context of the data and corrects
them.
Lemmatize
Converts words to their base form. For example,
better
becomes
good
and
running
becomes
run
.
Lemmatization preserves the semantic accuracy of the data, so it's useful for
sentiment analysis and machine translation.
Remove stop words
Removes common stop words like pronouns, articles, prepositions, and
conjunctions. For example,
This is a sample text
becomes
sample text
.
Converting words to lowercase and removing stop words is a simple and effective way to
reduce data complexity that applies to most NLP tasks.