PCRE Pattern Syntax

Here is a brief quick reference to the more common patterns you can use in PCRE regular expressions. The most commonly used is ".*", meaning any number of any character. This is equivalent to the wildcard "*" in the shell.

QUOTING - To prevent a character from being interpreted as a pattern meta-character, quote it.

\x where x is non-alphanumeric, indicates a literal x
\Q...\E treat enclosed characters as literal

CHARACTERS - How to specify characters, non-printable or programmatically.

\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\ddd character with octal code ddd, or backreference

\xhh character with hex code hh

\\x{hhh..}  character with hex code hhh..

CHARACTER TYPES - Match based on type of character.

. any character except newline;

in dotall mode, any character whatsoever

\C one byte, even in UTF-8 mode (best avoided)
\d a decimal digit
\D a character that is not a decimal digit
\h a horizontal whitespace character (e.g. space, tab, but not newline)
\H a character that is not a horizontal whitespace character
\p{xx} a character with the xx property (see below)
\P{xx} a character without the xx property (see below)
\R a newline sequence
\s a whitespace character
\S a character that is not a whitespace character
\v a vertical whitespace character (e.g. newline or CR)
\V a character that is not a vertical whitespace character
\w a "word" character
\W a "non-word" character

\X an extended Unicode sequence

In PCRE, \\d, \\D, \\s, \\S, \\w, and \\W recognize only ASCII characters.

GENERAL CATEGORY PROPERTY CODES for use with p and P

C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
L& Ll, Lu, or Lt
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator

CHARACTER CLASSES - Match a range or set of characters. For example, "[abc]" would match either a, b or c.

[...] positive character class
[^...] negative character
[x-y] range (can be used for hex characters)
[[:xxx:]] positive POSIX named set
[[:^xxx:]] negative POSIX named set

POSIX named sets for use in character classes:

alnum alphanumeric
alpha alphabetic
ascii 0-127
blank space or tab
cntrl control character
digit decimal digit
graph printing, excluding space
lower lower case letter
print printing, including space
punct printing, excluding alphanumeric
space whitespace
upper upper case letter
word same as w

xdigit hexadecimal digit

In PCRE, POSIX character set names recognize only ASCII characters. You
can use Q...E inside a character class.

QUANTIFIERS - Use this to limit regular expressions to match as much or as little as possible. For example, given the string "The quick brown fox slyly jumped over the lazy dog", the pattern "T.*e" would return "The quick brown fox slyly jumped over the", while the pattern "T.*?e" would simply return "The". Possessive matches are like greedy ones, except that it evaluates all the way to the end of the string, and if there's more to the pattern after it, that part will be unfulfilled and the match will fail. "T.*+e" would not match the above string at all.

? 0 or 1, greedy
?+ 0 or 1, possessive
?? 0 or 1, lazy
* 0 or more, greedy
*+ 0 or more, possessive
*? 0 or more, lazy
+ 1 or more, greedy
++ 1 or more, possessive
+? 1 or more, lazy
{n} exactly n
{n,m} at least n, no more than m, greedy
{n,m}+ at least n, no more than m, possessive
{n,m}? at least n, no more than m, lazy
{n,} n or more, greedy
{n,}+ n or more, possessive
{n,}? n or more, lazy

ANCHORS and SIMPLE ASSERTIONS - Match based on the position within the string.

\b word boundary
\B not a word boundary

^ start of subject

also after internal newline in multiline mode

\A start of subject

$ end of subject

also before newline at end of subject
also before internal newline in multiline mode

\Z end of subject
```
also before newline at end of subject
```
\z end of subject
\G first matching position in subject

ALTERNATION - Match any of several possible expressions.

expr|expr|expr...

CAPTURING - Return submatches.

(...) capturing group
(?...) named capturing group (Perl)
(?'name'...) named capturing group (Perl)
(?P...) named capturing group (Python)
(?:...) non-capturing group
(?|...) non-capturing group; reset group numbers for
```
capturing groups in each alternative
```

COMMENT - You shouldn't need this in Gambas but maybe you're writing regular expressions to be used in several different languages.

(?#....) comment (not nestable)

OPTION SETTING - You should use the constants found in the RegExp class as arguments to the Compile or Exec methods to make your code more readable, but if you're generating regular expressions at run time, these may be useful.

(?i) caseless
(?J) allow duplicate names
(?m) multiline
(?s) single line (dotall)
(?U) default ungreedy (lazy)
(?x) extended (ignore white space)
(?-...) unset option(s)

BACKREFERENCES - Refer to previous submatches in the current match.

n reference by number (can be ambiguous)
gn reference by number
g{n} reference by number
g{-n} relative reference by number

For more detailed information about the library, see http://www.regular-expressions.info/pcre.html or http://www.pcre.org. The above quick reference was adapted from the "pcresyntax" man page.