Appendix A: Regular expressions

MetaEdit+ uses regular expressions to check property values and symbol element conditions. Regular expressions (or regex for short) provide a way to define patterns of characters in formal fashion so that strings can be matched against them. As regex as a technique is not proprietary to MetaEdit+ but can be considered as elementary knowledge for software developers in general, it is not covered to the detail here in MetaEdit+ manuals. For basic information and a tutorial about regex, we recommend www.regular-expressions.info .

MetaEdit+ uses the open source Regex framework, initially by Vassili Bykov. Below is its documentation. Comparisons are presented as in MERL conditions: a literal string is compared to a regular expression in a literal string with the =/.

The simplest regular expression is a single character. It matches exactly that character. A sequence of characters matches a string with exactly the same sequence of characters:

'a'       =/ 'a'			-- true
'foobar'  =/ 'foobar'		-- true
'blorple' =/ 'foobar'		-- false

The above paragraph introduced a primitive regular expression (a character), and an operator (sequencing). Operators are applied to regular expressions to produce more complex regular expressions. Sequencing (placing expressions one after another) as an operator is, in a certain sense, 'invisible'--yet it is arguably the most common.

A more 'visible' operator is Kleene closure, more often simply referred to as 'a star'. A regular expression followed by an asterisk matches any number (including 0) of matches of the original expression. For example:

'ab'     =/ 'a*b'	 	-- true
'aaaaab' =/ 'a*b'		-- true
'b'      =/ 'a*b'	 	-- true
'aac'    =/ 'a*b'		-- false: b does not match

A star's precedence is higher than that of sequencing. A star applies to the shortest possible subexpression that precedes it. For example, 'ab*' means 'a followed by zero or more occurrences of b', not 'zero or more occurrences of ab':

'abbb' =/ 'ab*'	 		-- true
'abab' =/ 'ab*'		 	-- false

To actually make a regex matching 'zero or more occurrences of ab', 'ab' is enclosed in parentheses:

'abab'  =/ '(ab)*'		-- true
'abcab' =/ '(ab)*'	 	-- false: c spoils the fun

A regex that is enclosed in parentheses is called a subexpression. By default, subexpressions are 'capturing', however a subexpression can be marked as hidden' to become non-capturing. This is done by adding '?:' immediately after the opening parentheses. Marking subexpressions as non-capturing has no effect on the matching:

'abab'  =/ '(?:ab)*'		-- true
'abcab' =/ '(?:ab)*'	 	-- false: c spoils the fun

Two other operators similar to '*' are '+' and '?'. '+' (positive closure, or simply 'plus') matches one or more occurrences of the original expression. '?' ('optional') matches zero or one, but never more, occurrences.

'ac'   =/ 'ab*c'	 		-- true
'ac'   =/ 'ab+c'	 		-- false: need at least one b
'abbc' =/ 'ab+c'		 	-- true
'abbc' =/ 'ab?c'		 	-- false: too many b's

Repetitions can also be represented explicitly using numbers

	Exactly three occurences: {3}
	Two to five occurences: {2,5}
	Three or more occurences: {3,}
	Zero, one or two occurences: {,2}

'abbbc'    =/ 'ab{3}c'	-- true
'abc'      =/ 'ab{2,5}c'	-- false: need at least 2 b’s
'abbbbbbc' =/ 'ab{2,5}c'	-- false: may not have more
                              than 5 b’s
'abbbc'    =/ 'ab{2,5}c'	-- true
'abbbbbbc' =/ 'ab{3,}c'	-- true
'ac'       =/ 'ab{,2}c'	-- true

As we have seen, characters '*', '+', '?', '(', and ')' have special meanings in regular expressions. If any of these is to be used literally, it should be escaped: preceded with a backslash. (Thus, backslash is also a special character, and needs to be escaped for a literal match.) The full set of characters that require or may require backslash is: ^$\()*+.?[{|:

'ab*' =/ 'ab*'		 	-- false: star in the right
                              string is special
'ab*' =/ 'ab\*'	 		-- true
'a\c' =/ 'a\\c'		 	-- true

The last operator is '|' meaning 'or'. It is placed between two regular expressions, and the resulting expression matches if one of the expressions matches. It has the lowest possible precedence (lower than sequencing). For example, 'ab*|ba*' means 'a followed by any number of b's, or b followed by any number of a's':

'abb'  =/ 'ab*|ba*'	 	-- true
'baa'  =/ 'ab*|ba*'	 	-- true
'baab' =/ 'ab*|ba*'	 	-- false

A bit more complex example is the following expression, matching the name of any of the Lisp-style 'car', 'cdr', 'caar', 'cadr', ... functions:

c(a|d)+r

It is possible to write an expression matching an empty string, for example: 'a|'. However, it is an error to apply '*', '+', or '?' to such expression: '(a|)*' is an invalid expression.

So far, we have used only characters as the 'smallest' components of regular expressions. There are other, more 'interesting', components.

A character set is a string of characters enclosed in square brackets. It matches any single character if it appears between the brackets. For example, '[01]' matches either '0' or '1':

'0'  =/ '[01]'	    -- true
'3'  =/ '[01]'	    -- false
'11' =/ '[01]'	    -- false: a set matches only one char

Using plus operator, we can build the following binary number recognizer:

'10010100' =/ '[01]+'	 	-- true
'10001210' =/ '[01]+'	 	-- false

If the first character after the opening bracket is '^', the set is inverted: it matches any single character *not* appearing between the brackets:

'0' =/ '[^01]'		  	-- false
'3' =/ '[^01]'		 	-- true

For convenience, a set may include ranges: pairs of characters separated with '-'. This is equivalent to listing all characters between them: '[0-9]' is the same as '[0123456789]'.

Special characters within a set are '^', '-', and ']' that closes the set. Below are the examples of how to use them literally in a set:

[01^]		-- put the caret anywhere except the beginning
[01-]		-- put the dash last 
[]01]		-- put the closing bracket first 
[^]01]	(empty and universal sets cannot be specified)

Regular expressions can also include the following backquote escapes to refer to popular classes of characters:

\w	any word constituent character (= [a-zA-Z0-9_])
\W	any character but a word constituent
\d	a digit (same as [0-9])
\D	anything but a digit
\s 	a whitespace character (same as [:space:] below)
\S	anything but a whitespace character

These escapes are also allowed in character classes: '[\w+-]' means 'any character that is either a word constituent, or a plus, or a minus'.

Character classes can also include the following grep(1)-compatible elements to refer to:

[:alnum:]		any alphanumeric character
                (same as [a-zA-Z0-9]
[:alpha:]		any alphabetic character
                (same as [a-zA-Z])
[:cntrl:]		any control character
                (any character with code is < 32)
[:digit:]		any decimal digit (same as [0-9])
[:graph:]		any graphical character
                (any character with code >= 32)
[:lower:]		any lowercase character
                (including non-ASCII characters)
[:print:]		any printable character. In this version,
                this is the same as [:graph:]
[:punct:]		any punctuation character
                (. , ! ? ; : ' - ( ) ` and double quotes)
[:space:]		any whitespace character (space, tab, CR,
                LF, null, form feed, Ctrl-Z, 16r2000-
                16r200B, 16r3000)
[:upper:]		any uppercase character
                (including non-ASCII characters)                   
[:xdigit:]		any hexadecimal character
                (same as [a-fA-F0-9])

Note that these elements are components of the character classes, i.e. they have to be enclosed in an extra set of square brackets to form a valid regular expression. For example, a non-empty string of digits would be represented as '[[:digit:]]+'.

As an example, so far we have seen the following equivalent ways to write a regular expression that matches a non-empty string of digits:

[0-9]+
\d+
[\d]+
[[:digit::]+

The last group of special primitive expressions includes:

.	matching any character except a NULL; 
^	matching an empty string at the beginning of a line; 
$	matching an empty string at the end of a line.
\b	an empty string at a word boundary
\B	an empty string not at a word boundary
\<	an empty string at the beginning of a word
\>	an empty string at the end of a word

'axyzb' =/ 'a.+b'	-- true
'ax zb' =/ 'a.+b'	-- true (space is matched by '.')
'ax
zb' =/ 'a.+b'		-- true (newline is matched by '.')

Again, the dot ., caret ^ and dollar $ characters are special and should be quoted to be matched literally.

EXAMPLES

As the introduction said, a great use for regular expressions is user input validation. Following are a few examples of regular expressions that might be handy in checking input entered by the user in an input field.

Checking if aString may represent a nonnegative integer number:

	[0-9]+
or 	\d+

Checking if aString may represent an integer number with an optional sign in front:

(\+|-)?\d+

Checking if aString is a fixed-point number, with at least one digit is required after a dot:

(\+|-)?\d+(\.\d+)?

The same, but allow notation like '123.':

(\+|-)?\d+(\.\d*)?

Recognizer for a string that might be a name: one word with first capital letter, no blanks, no digits.

[A-Z][A-Za-z]*

A date in format MMM DD, YYYY with any number of spaces in between, in XX century:

(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)

Note the parentheses around some components of the expression above. In a generator to...endto mapping, these will allow us to obtain the actual strings that have matched them (i.e. month name, day number, and year number).

For dessert, coming back to numbers: here is a recognizer for a general number format: anything like 999, or 999.999, or -999.999e+21.

(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?