Article
· Jul 22, 2016 16m read

Using Regular Expressions in Caché

1.About this article

Just like Caché pattern matching, Regular Expressions can be used in Caché to identify patterns in text data – only with a much higher expressive power. This article provides a brief introduction into Regular Expressions and what you can do with it in Caché. The information provided herein is based on various sources, most notably the book “Mastering Regular Expressions” by Jeffrey Friedl and of course the Caché online documentation. The article is not intended to discuss all the possibilities and details of regular expressions. Please refer to the information sources listed in chapter 5 if you would like to learn more. If you prefer to read off-line you can also download the PDF version of this article.

Text processing using patterns can sometimes become complex. When dealing with regular expressions, we typically have several kinds of entities: the text we are searching for patterns, the pattern itself (the regular expression) and the matches (the parts of the text that match the pattern). To make it easy to distinguish between these entities, the following conventions are used throughout this document:

Text samples are printed in a monospace typeface on separate, without additional quotes:

This is a "text string" in which we want to find "something".

Unless unambiguous, regular expressions within the text body are visualized with a gray background such as in this example: \".*?\".

Matches are highlighted in different colors when needed:

This is a "text string" in which we want to find "something".

Larger code samples are printed in boxes like in the following example:

set t="This is a ""text string"" in which we want to find ""something""."
set r="\"".*?\"""
w $locate(t,r,,,tMatch)

2.Some history (and some trivia)

In the early 1940s, neuro-physiologists developed models for the human nervous system. Some years later, a mathematician described these models with an algebra he called “regular sets”. The notation for this algebra was named “regular expressions”.

In 1965, regular expressions are for the first time mentioned in the context of computers. With qed, an editor that was part of the UNIX operating system, regular expressions start to spread. Later versions of that editor provide a command sequence g/regular expression/p (global, regular expression, print) that searches for matches of the regular expression in all text lines and outputs the results. This command sequence eventually became the stand-alone UNIX command line program “grep”.

Today, various implementations of regular expressions (RegEx) exist for many programming languages (see section 3.3).