Mass Ingestion Guide

10.2.1

Back Next

Regular Expressions

A regular expression describes a range or pattern of values.

You can use a regular expression specify the columns that you want to parameterize in a mass ingestion specification. Use a regular expression when the columns in different source tables have varying names but contain the same information. If you choose to replace columns, you also use a regular expression to specify the pattern in the replace criteria.

For example, you might want to drop the columns that contain Social Security numbers. All of the column names contain

SSN

, but the column names have different prefixes depending on the source table where a column appears. To specify all variations in the column names, you can use a regular expression such as

.*SSN

The following table describes the metacharacters that you can use in a regular expression:

Metacharacter	Description
.	Matches any single character.
[ ]	Indicates a character class. Matches any character inside the brackets. For example, [abc] matches “a,” “b,” and “c.”
^	If this metacharacter occurs at the start of a character class, it negates the character class. A negated character class matches any character except those inside the brackets. For example, [^abc] matches all characters except “a,” “b,” and “c.” If this metacharacter occurs at the beginning of the regular expression, it matches the beginning of the input. For example, ^[abc] matches the input that begins with “a,” “b,” or “c.”
-	Indicates a range of characters in a character class. For example, [0-9] matches any of the digits “0” through “9.”
?	Indicates that the preceding expression to this metacharacter is optional. It matches the preceding expression zero or one time. For example, [0-9][0-9]? matches “2” and “12.”
+	Indicates that the preceding expression matches one or more times. For example, [0-9]+ matches “1,” “13,” “666,” and similar combinations.
*	Indicates that the preceding expression matches zero or more times. For example, the input <abc*> matches <abc>, <abc123>, and similar combinations that contains <abc> as the preceding expression.
??, +?, *?	Modified versions of ?, +, and . These match as little as possible, unlike the versions that match as much as possible. For example, the input “<abc><def>,” <.?> matches “<abc>” and the input <.*> matches “<abc><def>.”
( )	Grouping operator. For example, (\d+,)*\d+ matches a list of numbers separated by commas such as “1” or “1,23,456.”
{ }	Indicates a match group.
\	An escape character, which interprets the next metacharacter literally. For example, [0-9]+ matches one or more digits, but [0-9]\+ matches a digit followed by a plus character. Also used for abbreviations such as \a for any alphanumeric character. If \ is followed by a number n , it matches the nth match group, starting from 0. For example, <{.?}>.?</\0> matches “<head>Contents</head>”. In C++ string literals, two backslashes must be used: “\\+,” “\\a,” “<{.?}>.?</\\0>.”
$	At the end of a regular expression, this character matches the end of the input. For example, [0-9]$ matches a digit at the end of the input.
\|	Alternation operator that separates two expressions, one of which matches. For example, T\|the matches “The” or “the.”
!	Negation operator. The expression following ! does not match the input. For example, a!b matches “a” not followed by “b.”

The following table describes the abbreviations that you can use in the regular expressions:

Abbreviation	Definition
\a	Any alphanumeric character, ([a-zA-Z0-9]).
\b	White space (blank), ([ \\t]).
\c	Any alphabetic character, ([a-zA-Z]).
\d	Any decimal digit, ([0-9]).
\h	Any hexadecimal digit, ([0-9a-fA-F]).
\n	Newline, (\r\|(\r?\n)).
\q	Quoted string, (\”[^\”]\”)\|(\’[^\’]\’).
\w	Simple word, ([a-zA-Z]+).
\z	Integer, ([0-9+]).