Appendix A: Regular expressions
MetaEdit+ uses regular expressions to check property values
and symbol element conditions. Regular expressions (or regex for short) provide
a way to define patterns of characters in formal fashion so that strings can be
matched against them. As regex as a technique is not proprietary to MetaEdit+
but can be considered as elementary knowledge for software developers in
general, it is not covered to the detail here in MetaEdit+ manuals. For basic
information and a tutorial about regex, we recommend
www.regular-expressions.info.
MetaEdit+ uses the open source Regex framework, initially
by Vassili Bykov. Below is its documentation. Comparisons are presented as in
MERL conditions: a literal string is compared to a regular expression in a
literal string with the =/.
The simplest regular expression is a single character. It
matches exactly that character. A sequence of characters matches a string with
exactly the same sequence of characters:
'a' =/ 'a' -- true
'foobar' =/ 'foobar' -- true
'blorple' =/ 'foobar' -- false
The above paragraph
introduced a primitive regular expression (a character), and an operator
(sequencing). Operators are applied to regular expressions to produce more
complex regular expressions. Sequencing (placing expressions one after another)
as an operator is, in a certain sense, 'invisible'--yet it is arguably the most
common.
A more 'visible' operator is Kleene closure, more often
simply referred to as 'a star'. A regular expression followed by an asterisk
matches any number (including 0) of matches of the original expression. For
example:
'ab' =/ 'a*b' -- true
'aaaaab' =/ 'a*b' -- true
'b' =/ 'a*b' -- true
'aac' =/ 'a*b' -- false: b does not match
A star's
precedence is higher than that of sequencing. A star applies to the shortest
possible subexpression that precedes it. For example, 'ab*' means 'a followed by
zero or more occurrences of b', not 'zero or more occurrences of ab':
'abbb' =/ 'ab*' -- true
'abab' =/ 'ab*' -- false
To actually make a regex
matching 'zero or more occurrences of ab', 'ab' is enclosed in
parentheses:
'abab' =/ '(ab)*' -- true
'abcab' =/ '(ab)*' -- false: c spoils the fun
A regex
that is enclosed in parentheses is called a subexpression. By default,
subexpressions are 'capturing', however a subexpression can be marked as hidden'
to become non-capturing. This is done by adding '?:' immediately after the
opening parentheses. Marking subexpressions as non-capturing has no effect on
the matching:
'abab' =/ '(?:ab)*' -- true
'abcab' =/ '(?:ab)*' -- false: c spoils the fun
Two
other operators similar to '*' are '+' and '?'. '+' (positive closure, or simply
'plus') matches one or more occurrences of the original expression. '?'
('optional') matches zero or one, but never more, occurrences.
'ac' =/ 'ab*c' -- true
'ac' =/ 'ab+c' -- false: need at least one b
'abbc' =/ 'ab+c' -- true
'abbc' =/ 'ab?c' -- false: too many b's
Repetitions
can also be represented explicitly using numbers
| Exactly
three occurences:
{3} |
| Two
to five occurences:
{2,5} |
| Three
or more occurences:
{3,} |
| Zero,
one or two occurences: {,2} |
'abbbc' =/ 'ab{3}c' -- true
'abc' =/ 'ab{2,5}c' -- false: need at least 2 b’s
'abbbbbbc' =/ 'ab{2,5}c' -- false: may not have more
than 5 b’s
'abbbc' =/ 'ab{2,5}c' -- true
'abbbbbbc' =/ 'ab{3,}c' -- true
'ac' =/ 'ab{,2}c' -- true
As we have seen,
characters '*', '+', '?', '(', and ')' have special meanings in regular
expressions. If any of these is to be used literally, it should be escaped:
preceded with a backslash. (Thus, backslash is also a special character, and
needs to be escaped for a literal match.) The full set of characters that
require or may require backslash is:
^$\()*+.?[{|:
'ab*' =/ 'ab*' -- false: star in the right
string is special
'ab*' =/ 'ab\*' -- true
'a\c' =/ 'a\\c' -- true
The last operator is '|'
meaning 'or'. It is placed between two regular expressions, and the resulting
expression matches if one of the expressions matches. It has the lowest possible
precedence (lower than sequencing). For example, 'ab*|ba*' means 'a followed by
any number of b's, or b followed by any number of a's':
'abb' =/ 'ab*|ba*' -- true
'baa' =/ 'ab*|ba*' -- true
'baab' =/ 'ab*|ba*' -- false
A bit more complex example
is the following expression, matching the name of any of the Lisp-style 'car',
'cdr', 'caar', 'cadr', ... functions:
c(a|d)+r
It is possible to write an expression
matching an empty string, for example: 'a|'. However, it is an error to apply
'*', '+', or '?' to such expression: '(a|)*' is an invalid expression.
So far, we have used only characters as the 'smallest'
components of regular expressions. There are other, more 'interesting',
components.
A character set is a string of characters enclosed in
square brackets. It matches any single character if it appears between the
brackets. For example, '[01]' matches either '0' or '1':
'0' =/ '[01]' -- true
'3' =/ '[01]' -- false
'11' =/ '[01]' -- false: a set matches only one char
Using
plus operator, we can build the following binary number recognizer:
'10010100' =/ '[01]+' -- true
'10001210' =/ '[01]+' -- false
If the first character
after the opening bracket is '^', the set is inverted: it matches any single
character *not* appearing between the brackets:
'0' =/ '[^01]' -- false
'3' =/ '[^01]' -- true
For convenience, a set may
include ranges: pairs of characters separated with '-'. This is equivalent to
listing all characters between them: '[0-9]' is the same as
'[0123456789]'.
Special characters within a set are '^', '-', and ']' that
closes the set. Below are the examples of how to use them literally in a
set:
[01^] -- put the caret anywhere except the beginning
[01-] -- put the dash last
[]01] -- put the closing bracket first
[^]01] (empty and universal sets cannot be specified)
Regular
expressions can also include the following backquote escapes to refer to popular
classes of characters:
\w any word constituent character (= [a-zA-Z0-9_])
\W any character but a word constituent
\d a digit (same as [0-9])
\D anything but a digit
\s a whitespace character (same as [:space:] below)
\S anything but a whitespace character
These escapes are
also allowed in character classes: '[\w+-]' means 'any character that is either
a word constituent, or a plus, or a minus'.
Character classes can also include the following
grep(1)-compatible elements to refer to:
[:alnum:] any alphanumeric character
(same as [a-zA-Z0-9]
[:alpha:] any alphabetic character
(same as [a-zA-Z])
[:cntrl:] any control character
(any character with code is < 32)
[:digit:] any decimal digit (same as [0-9])
[:graph:] any graphical character
(any character with code >= 32)
[:lower:] any lowercase character
(including non-ASCII characters)
[:print:] any printable character. In this version,
this is the same as [:graph:]
[:punct:] any punctuation character
(. , ! ? ; : ' - ( ) ` and double quotes)
[:space:] any whitespace character (space, tab, CR,
LF, null, form feed, Ctrl-Z, 16r2000-
16r200B, 16r3000)
[:upper:] any uppercase character
(including non-ASCII characters)
[:xdigit:] any hexadecimal character
(same as [a-fA-F0-9])
Note that these
elements are components of the character classes, i.e. they have to be enclosed
in an extra set of square brackets to form a valid regular expression. For
example, a non-empty string of digits would be represented as
'[[:digit:]]+'.
As an example, so far we have seen the following
equivalent ways to write a regular expression that matches a non-empty string of
digits:
[0-9]+
\d+
[\d]+
[[:digit::]+
The last group of special primitive
expressions includes:
. matching any character except a NULL;
^ matching an empty string at the beginning of a line;
$ matching an empty string at the end of a line.
\b an empty string at a word boundary
\B an empty string not at a word boundary
\< an empty string at the beginning of a word
\> an empty string at the end of a word
'axyzb' =/ 'a.+b' -- true
'ax zb' =/ 'a.+b' -- true (space is matched by '.')
'ax
zb' =/ 'a.+b' -- true (newline is matched by '.')
Again,
the dot ., caret ^ and dollar $ characters are special and should be quoted to
be matched literally.
EXAMPLES
As the introduction said, a great use for
regular expressions is user input validation. Following are a few examples of
regular expressions that might be handy in checking input entered by the user in
an input field.
Checking if aString may represent a nonnegative integer
number:
[0-9]+
or \d+
Checking if aString may represent an integer
number with an optional sign in front:
(\+|-)?\d+
Checking if aString is a fixed-point
number, with at least one digit is required after a dot:
(\+|-)?\d+(\.\d+)?
The same, but allow notation like
'123.':
(\+|-)?\d+(\.\d*)?
Recognizer for a string that
might be a name: one word with first capital letter, no blanks, no digits.
[A-Z][A-Za-z]*
A date in format MMM DD, YYYY with
any number of spaces in between, in XX century:
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)
Note the parentheses around
some components of the expression above. In a generator to...endto mapping,
these will allow us to obtain the actual strings that have matched them (i.e.
month name, day number, and year number).
For dessert, coming back to numbers: here is a recognizer
for a general number format: anything like 999, or 999.999, or
-999.999e+21.
(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?