Dokumentaro de Gambaso
Compilation & Installation
Components
Documents
Frequently Asked Questions
Indekso de Lingvo
Language Overviews
LeguMin
Lexicon
Registro

PCRE Pattern Syntax

Here is a brief quick reference to the more common patterns you can use in PCRE regular expressions. The most commonly used is ".*", meaning any number of any character. This is equivalent to the wildcard "*" in the shell.

QUOTING - To prevent a character from being interpreted as a pattern meta-character, quote it.

  • \x where x is non-alphanumeric, indicates a literal x

  • \Q...\E treat enclosed characters as literal

CHARACTERS - How to specify characters, non-printable or programmatically.

  • \a alarm, that is, the BEL character (hex 07)

  • \cx "control-x", where x is any character

  • \e escape (hex 1B)

  • \f formfeed (hex 0C)

  • \n newline (hex 0A)

  • \r carriage return (hex 0D)

  • \t tab (hex 09)

  • \ddd character with octal code ddd, or backreference

  • \xhh character with hex code hh
    \\x{hhh..}  character with hex code hhh..
    

CHARACTER TYPES - Match based on type of character.

  • . any character except newline;
    in dotall mode, any character whatsoever
    

  • \C one byte, even in UTF-8 mode (best avoided)

  • \d a decimal digit

  • \D a character that is not a decimal digit

  • \h a horizontal whitespace character (e.g. space, tab, but not newline)

  • \H a character that is not a horizontal whitespace character

  • \p{xx} a character with the xx property (see below)

  • \P{xx} a character without the xx property (see below)

  • \R a newline sequence

  • \s a whitespace character

  • \S a character that is not a whitespace character

  • \v a vertical whitespace character (e.g. newline or CR)

  • \V a character that is not a vertical whitespace character

  • \w a "word" character

  • \W a "non-word" character

  • \X an extended Unicode sequence

    In PCRE, \\d, \\D, \\s, \\S, \\w, and \\W recognize only ASCII characters.
    

GENERAL CATEGORY PROPERTY CODES for use with p and P

  • C Other

  • Cc Control

  • Cf Format

  • Cn Unassigned

  • Co Private use

  • Cs Surrogate

  • L Letter

  • Ll Lower case letter

  • Lm Modifier letter

  • Lo Other letter

  • Lt Title case letter

  • Lu Upper case letter

  • L& Ll, Lu, or Lt

  • M Mark

  • Mc Spacing mark

  • Me Enclosing mark

  • Mn Non-spacing mark

  • N Number

  • Nd Decimal number

  • Nl Letter number

  • No Other number

  • P Punctuation

  • Pc Connector punctuation

  • Pd Dash punctuation

  • Pe Close punctuation

  • Pf Final punctuation

  • Pi Initial punctuation

  • Po Other punctuation

  • Ps Open punctuation

  • S Symbol

  • Sc Currency symbol

  • Sk Modifier symbol

  • Sm Mathematical symbol

  • So Other symbol

  • Z Separator

  • Zl Line separator

  • Zp Paragraph separator

  • Zs Space separator

CHARACTER CLASSES - Match a range or set of characters. For example, "[abc]" would match either a, b or c.

  • [...] positive character class

  • [^...] negative character

  • [x-y] range (can be used for hex characters)

  • [[:xxx:]] positive POSIX named set

  • [[:^xxx:]] negative POSIX named set

POSIX named sets for use in character classes:

  • alnum alphanumeric

  • alpha alphabetic

  • ascii 0-127

  • blank space or tab

  • cntrl control character

  • digit decimal digit

  • graph printing, excluding space

  • lower lower case letter

  • print printing, including space

  • punct printing, excluding alphanumeric

  • space whitespace

  • upper upper case letter

  • word same as w

  • xdigit hexadecimal digit

    In PCRE, POSIX character set names recognize only ASCII characters. You
    can use Q...E inside a character class.
    

QUANTIFIERS - Use this to limit regular expressions to match as much or as little as possible. For example, given the string "The quick brown fox slyly jumped over the lazy dog", the pattern "T.*e" would return "The quick brown fox slyly jumped over the", while the pattern "T.*?e" would simply return "The". Possessive matches are like greedy ones, except that it evaluates all the way to the end of the string, and if there's more to the pattern after it, that part will be unfulfilled and the match will fail. "T.*+e" would not match the above string at all.

  • ? 0 or 1, greedy

  • ?+ 0 or 1, possessive

  • ?? 0 or 1, lazy

  • * 0 or more, greedy

  • *+ 0 or more, possessive

  • *? 0 or more, lazy

  • + 1 or more, greedy

  • ++ 1 or more, possessive

  • +? 1 or more, lazy

  • {n} exactly n

  • {n,m} at least n, no more than m, greedy

  • {n,m}+ at least n, no more than m, possessive

  • {n,m}? at least n, no more than m, lazy

  • {n,} n or more, greedy

  • {n,}+ n or more, possessive

  • {n,}? n or more, lazy

ANCHORS and SIMPLE ASSERTIONS - Match based on the position within the string.

  • \b word boundary

  • \B not a word boundary

  • ^ start of subject
    also after internal newline in multiline mode
    

  • \A start of subject

  • $ end of subject
    also before newline at end of subject
    also before internal newline in multiline mode
    

  • \Z end of subject
    also before newline at end of subject
    

  • \z end of subject

  • \G first matching position in subject

ALTERNATION - Match any of several possible expressions.

  • expr|expr|expr...

CAPTURING - Return submatches.

  • (...) capturing group

  • (?...) named capturing group (Perl)

  • (?'name'...) named capturing group (Perl)

  • (?P...) named capturing group (Python)

  • (?:...) non-capturing group

  • (?|...) non-capturing group; reset group numbers for
    capturing groups in each alternative
    

COMMENT - You shouldn't need this in Gambas but maybe you're writing regular expressions to be used in several different languages.

  • (?#....) comment (not nestable)

OPTION SETTING - You should use the constants found in the RegExp class as arguments to the Compile or Exec methods to make your code more readable, but if you're generating regular expressions at run time, these may be useful.

  • (?i) caseless

  • (?J) allow duplicate names

  • (?m) multiline

  • (?s) single line (dotall)

  • (?U) default ungreedy (lazy)

  • (?x) extended (ignore white space)

  • (?-...) unset option(s)

BACKREFERENCES - Refer to previous submatches in the current match.

  • n reference by number (can be ambiguous)

  • gn reference by number

  • g{n} reference by number

  • g{-n} relative reference by number

For more detailed information about the library, see http://www.regular-expressions.info/pcre.html or http://www.pcre.org. The above quick reference was adapted from the "pcresyntax" man page.