Perl Compatible Regular Expressions

Last updated
Perl Compatible Regular Expressions
Original author(s) Philip Hazel
Stable release(s)
PCRE18.45 / June 15, 2021;2 years ago (2021-06-15) [1]
PCRE210.43 / February 16, 2024;1 day ago (2024-02-16) [2]
Repository
Written in C
Operating system Cross-platform
Type Pattern matching library
License BSD
Website pcre.org OOjs UI icon edit-ltr-progressive.svg

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. [3] PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors (BRE, ERE) [4] and than that of many other regular-expression libraries.

Contents

While PCRE originally aimed at feature-equivalence with Perl, the two implementations are not fully equivalent. During the PCRE 7.x and Perl 5.9.x phase, the two projects coordinated development, with features being ported between them in both directions. [5]

In 2015, a fork of PCRE was released with a revised programming interface (API). The original software, now called PCRE1 (the 1.xx–8.xx series), has had bugs mended, but no further development. As of 2020, it is considered obsolete, and the current 8.45 release is likely to be the last. The new PCRE2 code (the 10.xx series) has had a number of extensions and coding improvements and is where development takes place.

A number of prominent open-source programs, such as the Apache and Nginx HTTP servers, and the PHP and R scripting languages, incorporate the PCRE library; proprietary software can do likewise, as the library is BSD-licensed. As of Perl 5.10, PCRE is also available as a replacement for Perl's default regular-expression engine through the re::engine::PCRE module.

The library can be built on Unix, Windows, and several other environments. PCRE2 is distributed with a POSIX C wrapper, [Note 1] several test programs, and the utility program pcre2grep that is built in tandem with the library.

Features

Just-in-time compiler support

This optional feature is available if enabled when the PCRE2 library is built. Large performance benefits are possible when (for example) the calling program utilizes the feature with compatible patterns that are executed repeatedly. The just-in-time compiler support was written by Zoltan Herczeg and is not addressed in the POSIX wrapper.

Flexible memory management

The use of the system stack for backtracking can be problematic in PCRE1, which is why this feature of the implementation was changed in PCRE2. The heap is now used for this purpose, and the total amount can be limited. The problem of stack overflow, which came up regularly with PCRE1, is no longer an issue with PCRE2 from release 10.30 (2017).

Consistent escaping rules

Like Perl, PCRE2 has consistent escaping rules: any non-alpha-numeric character may be escaped to mean its literal value by prefixing a \ (backslash) before the character. Any alpha-numeric character preceded by a backslash typically gives it a special meaning. In the case where the sequence has not been defined to be special, an error occurs. This is different to Perl, which gives an error only if it is in warning mode (PCRE2 does not have a warning mode). In basic POSIX regular expressions, sometimes backslashes escaped non-alpha-numerics (e.g. \.), and sometimes they introduced a special feature (e.g. \(\)).

Extended character classes

Single-letter character classes are supported in addition to the longer POSIX names. For example, \d matches any digit exactly as [[:digit:]] would in POSIX regular expressions.

Minimal matching (a.k.a. "ungreedy")

A ? may be placed after any repetition quantifier to indicate that the shortest match should be used. The default is to attempt the longest match first and backtrack through shorter matches: e.g. a.*?b would match first "ab" in "ababab", where a.*b would match the entire string.

If the U flag is set, then quantifiers are ungreedy (lazy) by default, while ? makes them greedy.

Unicode character properties

Unicode defines several properties for each character. Patterns in PCRE2 can match these properties: e.g. \p{Ps}.*?\p{Pe} would match a string beginning with any "opening punctuation" and ending with any "close punctuation" such as [abc]. Matching of certain "normal" metacharacters can be driven by Unicode properties when the compile option PCRE2_UCP is set. The option can be set for a pattern by including (*UCP) at the start of pattern. The option alters behavior of the following metacharacters: \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. For example, the set of characters matched by \w (word characters) is expanded to include letters and accented letters as defined by Unicode properties. Such matching is slower than the normal (ASCII-only) non-UCP alternative. Note that the UCP option requires the library to have been built to include Unicode support (this is the default for PCRE2). Very early versions of PCRE1 supported only ASCII code. Later, UTF-8 support was added. Support for UTF-16 was added in version 8.30, and support for UTF-32 in version 8.32. PCRE2 has always supported all three UTF encodings.

Multiline matching

^ and $ can match at the beginning and end of a string only, or at the start and end of each "line" within the string, depending on what options are set.

Newline/linebreak options

When PCRE is compiled, a newline default is selected. Which newline/linebreak is in effect affects where PCRE detects ^ line beginnings and $ ends (in multiline mode), as well as what matches dot (regardless of multiline mode, unless the dotall option (?s) is set). It also affects PCRE matching procedure (since version 7.0): when an unanchored pattern fails to match at the start of a newline sequence, PCRE advances past the entire newline sequence before retrying the match. If the newline option alternative in effect includes CRLF as one of the valid linebreaks, it does not skip the \n in a CRLF if the pattern contains specific \r or \n references (since version 7.3). Since version 8.10, the metacharacter \N always matches any character other than linebreak characters. It has the same behavior as . when the dotall option aka (?s) is not in effect.

The newline option can be altered with external options when PCRE is compiled and when it is run. Some applications using PCRE provide users with the means to apply this setting through an external option. So the newline option can also be stated at the start of the pattern using one of the following:

When not in UTF-8 mode, corresponding linebreaks can be matched with (?:\r\n?|\n|\x0B|\f|\x85) [Note 2] or \R.

In UTF-8 mode, two additional characters are recognized as line breaks with (*ANY):

On Windows, in non-Unicode data, some of the ANY linebreak characters have other meanings.

For example, \x85 can match a horizontal ellipsis, and if encountered while the ANY newline is in effect, it would trigger newline processing.

See below for configuration and options concerning what matches backslash-R.

Backslash-R options

When PCRE is compiled, a default is selected for what matches \R. The default can be either to match the linebreaks corresponding to ANYCRLF or those corresponding to ANY. The default can be overridden when necessary by including (*BSR_UNICODE) or (*BSR_ANYCRLF) at the start of the pattern. When providing a (*BSR..) option, you can also provide a (*newline) option, e.g., (*BSR_UNICODE)(*ANY)rest-of-pattern. The backslash-R options also can be changed with external options by the application calling PCRE2, when a pattern is compiled.

Beginning of pattern options

Linebreak options such as (*LF) documented above; backslash-R options such as (*BSR_ANYCRLF) documented above; Unicode Character Properties option (*UCP) documented above; (*UTF8) option documented as follows: if your PCRE2 library has been compiled with UTF support, you can specify the (*UTF) option at the beginning of a pattern instead of setting an external option to invoke UTF-8, UTF-16, or UTF-32 mode.

Backreferences

A pattern may refer back to the results of a previous match. For example, (a|b)c\1 would match either "aca" or "bcb" and would not match, for example, "acb".

Named subpatterns

A sub-pattern (surrounded by parentheses, like (...)) may be named by including a leading ?P<name> after the opening parenthesis. Named subpatterns are a feature that PCRE adopted from Python regular expressions.

This feature was subsequently adopted by Perl, so now named groups can also be defined using (?<name>...) or (?'name'...), as well as (?P<name>...). Named groups can be backreferenced with, for example: (?P=name) (Python syntax) or \k'name' (Perl syntax).

Subroutines

While a backreference provides a mechanism to refer to that part of the subject that has previously matched a subpattern, a subroutine provides a mechanism to reuse an underlying previously defined subpattern. The subpattern's options, such as case independence, are fixed when the subpattern is defined. (a.c)(?1) would match "aacabc" or "abcadc", whereas using a backreference (a.c)\1 would not, though both would match "aacaac" or "abcabc". PCRE also supports a non-Perl Oniguruma construct for subroutines. They are specified using \g<subpat-number> or \g<subpat-name>.

Atomic grouping

Atomic grouping is a way of preventing backtracking in a pattern. For example, a++bc will match as many "a"s as possible and never back up to try one less.

Look-ahead and look-behind assertions

AssertionLookbehindLookahead
Positive(?<=pattern)(?=pattern)
Negative(?<!pattern)(?!pattern)
Look-behind and look-ahead assertions
in Perl regular expressions

Patterns may assert that previous text or subsequent text contains a pattern without consuming matched text (zero-width assertion). For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab itself.

Look-behind assertions cannot be of uncertain length though (unlike Perl) each branch can be a different fixed length.

\K can be used in a pattern to reset the start of the current whole match. This provides a flexible alternative approach to look-behind assertions because the discarded part of the match (the part that precedes \K) need not be fixed in length.

Escape sequences for zero-width assertions

E.g. \b for matching zero-width "word boundaries", similar to (?<=\W)(?=\w)|(?<=\w)(?=\W)|^|$.

Comments

A comment begins with (?# and ends at the next closing parenthesis.

Recursive patterns

A pattern can refer back to itself recursively or to any subpattern. For example, the pattern \((a*|(?R))*\) will match any combination of balanced parentheses and "a"s.

Generic callouts

PCRE expressions can embed (?Cn), where n is some number. This will call out to an external user-defined function through the PCRE API and can be used to embed arbitrary code in a pattern.

Differences from Perl

Differences between PCRE2 and Perl (as of Perl 5.9.4) include but are not limited to: [6]

Until release 10.30 recursive matches were atomic in PCRE and non atomic in Perl

This meant that "<<!>!>!>><>>!>!>!>"=~ /^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/ would match in Perl but not in PCRE2 until release 10.30.

The value of a capture buffer deriving from the ? quantifier (match 1 or 0 times) when nested in another quantified capture buffer is different

In Perl "aba"=~ /^(a(b)?)+$/; will result in $1 containing "a" and $2 containing undef, but in PCRE will result in $2 containing "b".

PCRE allows named capture buffers to be given numeric names; Perl requires the name to follow the rule of barewords

This means that \g{} is unambiguous in Perl, but potentially ambiguous in PCRE.

This is no longer a difference since PCRE 8.34 (released on 2013-12-15), which no longer allows group names to start with a digit. [7]

PCRE allows alternatives within lookbehind to be different lengths

Within lookbehind assertions, both PCRE and Perl require fixed-length patterns.

That is, both PCRE and Perl disallow variable-length patterns using quantifiers within lookbehind assertions.

However, Perl requires all alternative branches of a lookbehind assertion to be the same length as each other, whereas PCRE allows those alternative branches to have different lengths from each other as long as each branch still has a fixed length.

PCRE does not support certain "experimental" Perl constructs

Such as (??{...}) (a callback whose return is evaluated as being part of the pattern) nor the (?{}) construct, although the latter can be emulated using (?Cn).

Recursion control verbs added in the Perl 5.9.x series are also not supported.

Support for experimental backtracking control verbs (added in Perl 5.10) is available in PCRE since version 7.3.

They are (*FAIL), (*F), (*PRUNE), (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT).

Perl's corresponding use of arguments with backtracking control verbs is not generally supported.

Note however that since version 8.10, PCRE supports the following verbs with a specified argument: (*MARK:markName), (*SKIP:markName), (*PRUNE:markName), and (*THEN:markName).

Since version 10.32 PCRE2 has supported (*ACCEPT:markName), (*FAIL:markName), and (*COMMIT:markName).

PCRE and Perl are slightly different in their tolerance of erroneous constructs

Perl allows quantifiers on the (?!...) construct, which is meaningless but harmless (albeit inefficient); PCRE produces an error in versions before 8.13.

PCRE has a hard limit on recursion depth, Perl does not

With default build options "bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"=~ /.X(.+)+X/ will fail to match due to the limit, but Perl will match this correctly.

Perl uses the heap for recursion and has no hard limit for recursion depth, whereas PCRE2 has a compile-time default limit that can be adjusted up or down by the calling application.

Verification

With the exception of the above points, PCRE is capable of passing the tests in the Perl "t/op/re_tests" file, one of the main syntax-level regression tests for Perl's regular expression engine.

Notes and references

Notes

  1. The core PCRE2 library provides both matching and match and replace functionality.
  2. Sure the \x85 part is not \xC2\x85? (i.e. (?:\r\n?|\n|\x0B|\f|\xC2\x85), as U+0085  != 0x85)

    Caveat: If the pattern \xC2\x85 failed to work: experiment with the RegEx implementation's Unicode settings, or try substituting with the following:
    • \x{0085}
    • \u0085

Related Research Articles

<span class="mw-page-title-main">AWK</span> Programming language

AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect. grep was originally developed for the Unix operating system, but later available for all Unix-like systems and some others such as OS-9.

A metacharacter is a character that has a special meaning to a computer program, such as a shell interpreter or a regular expression (regex) engine.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

A path is a string of characters used to uniquely identify a location in a directory structure. It is composed by following the directory tree hierarchy in which components, separated by a delimiting character, represent each directory. The delimiting character is most commonly the slash ("/"), the backslash character ("\"), or colon (":"), though some operating systems may use a different delimiter. Paths are used extensively in computer science to represent the directory/file relationships common in modern operating systems and are essential in the construction of Uniform Resource Locators (URLs). Resources can be represented by either absolute or relative paths.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters ".

xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.

wildmat is a pattern matching library developed by Rich Salz. Based on the wildcard syntax already used in the Bourne shell, wildmat provides a uniform mechanism for matching patterns across applications with simpler syntax than that typically offered by regular expressions. Patterns are implicitly anchored at the beginning and end of each string when testing for a match.

In Unix-like and some other operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

.properties is a file extension for files mainly used in Java-related technologies to store the configurable parameters of an application. They can also be used for storing strings for Internationalization and localization; these are known as Property Resource Bundles.

In computer programming, leaning toothpick syndrome (LTS) is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes ("\"), to avoid delimiter collision.

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

This is a comparison of regular expression engines.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

References

  1. Final release of PCRE1: https://lists.exim.org/lurker/message/20210615.162400.c16ff8a3.en.html
  2. Releases: https://github.com/PCRE2Project/pcre2/releases
  3. Exim and PCRE: How free software hijacked my life (1999-12), by Philip Hazel, p. 7: https://www.ukuug.org/events/winter99/proc/PH.ps
    What about PCRE?
    • Written summer 1997, placed on ftp site.
    • People found it, and started a mailing list.
    • There has been a trickle of enhancements.
  4. PCRE2 - Perl-compatible regular expressions (revised API) (2020), by University of Cambridge: https://pcre.org/pcre2.txt
  5. Differences Between PCRE2 and Perl (2019-07-13), by Philip Hazel: https://www.pcre.org/current/doc/html/pcre2compat.html
  6. Quote PCRE changelog (https://www.pcre.org/original/changelog.txt): "Perl no longer allows group names to start with digits, so I have made this change also in PCRE."
  7. ChangeLog for PCRE2: https://www.pcre.org/changelog.txt

See also