RE2 (software)

Last updated
RE2
Original author(s) Google
Initial releaseMarch 11, 2010;14 years ago (2010-03-11) [1]
Stable release
2021-04-01 / April 1, 2021;3 years ago (2021-04-01) [2]
Repository
Written in C++
Operating system Cross-platform
Type Pattern matching library
License BSD
Website github.com/google/re2 OOjs UI icon edit-ltr-progressive.svg

RE2 is a software library which implements a regular expression engine via finite-state machines using automata theory, in contrast to almost all other regular expression libraries, which use backtracking implementations. It provides a C++ interface.

Contents

RE2 was implemented and used by Google. The library uses an "on-the-fly" deterministic finite-state automaton algorithm based on Ken Thompson's Plan 9 grep. [3]

Comparison to PCRE

RE2 generally compares to Perl Compatible Regular Expressions (PCRE) in performance. For certain regular expression operators like | (the operator for alternation or logical disjunction) it exceeds PCRE. On the other hand, unlike PCRE which supports features such as backreferences, RE2 is only able to recognize regular languages due to its construction using the Thompson DFA [3] algorithm. It is also slightly slower than PCRE for parenthetic capturing operations.

PCRE can use a large recursive stack with corresponding high memory usage and result in exponential runtime on certain patterns. In contrast, RE2 uses a fixed stack size and guarantees that its runtime increases linearly (not exponentially) with the size of the input. The maximum memory allocated with RE2 is configurable.

RE2 has a slightly smaller set of features than PCRE, but has very predictable run-time and a maximum memory allotment. This can make it more suitable for use in server applications, which require boundaries on memory usage and computational time. PCRE, on the other hand supports more features such as lookarounds, backreferences and recursion, but has unpredictable runtime and memory usage which can grow unbounded.

Adoption

Use in Google products

RE2 is used by Google products like Gmail, Google Documents and Google Sheets. [4] See GitHub for a documentation of the syntax: RE2 syntax.

In Google Sheets, it is used in the functions RegexMatch(), RegexReplace(), RegexExtract() and the find and replace feature. RegexExtract(), does not use grouping.

The RE2 algorithm has been rewritten in Rust as the package "regex". CloudFlare's web application firewall uses this package because the RE2 algorithm is immune to ReDoS. [5]

Russ Cox also wrote RE1, an earlier regular expression based on a bytecode interpreter. [6] OpenResty uses a RE1 fork called "sregex". [7]

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

SNOBOL is a series of programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph Griswold and Ivan P. Polonsky, culminating in SNOBOL4. It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.

grep is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect. grep was originally developed for the Unix operating system, but later available for all Unix-like systems and some others such as OS-9.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters ".

An .htaccess file is a directory-level configuration file supported by several web servers, used for configuration of website-access issues, such as URL redirection, URL shortening, access control, and more. The 'dot' before the file name makes it a hidden file in Unix-based environments.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

The following outline is provided as an overview of and topical guide to computer programming:

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

The Parser Grammar Engine is a compiler and runtime for Raku rules for the Parrot virtual machine. PGE uses these rules to convert a parsing expression grammar into Parrot bytecode. It is therefore compiling rules into a program, unlike most virtual machines and runtimes, which store regular expressions in a secondary internal format that is then interpreted at runtime by a regular expression engine. The rules format used by PGE can express any regular expression and most formal grammars, and as such it forms the first link in the compiler chain for all of Parrot's front-end languages.

Google Code Search was a free beta product from Google which debuted in Google Labs on October 5, 2006, allowing web users to search for open-source code on the Internet. Features included the ability to search using operators, namely lang:, package:, license:, and file:.

This is a comparison of regular expression engines.

A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In the automata theory, a tagged deterministic finite automaton (TDFA) is an extension of deterministic finite automaton (DFA). In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression.

References

  1. Cox, Russ (March 11, 2010). "RE2: a principled approach to regular expression matching". Google Open Source Blog. Retrieved 2020-05-29.
  2. "Releases". Github. Retrieved 2021-05-03.
  3. 1 2 Cox, Russ. "Regular Expression Matching in the Wild". swtch.com.
  4. "Search and use find and replace" . Retrieved 24 March 2020.
  5. "Making the WAF 40% faster". The Cloudflare Blog. 1 July 2020.
  6. "Regular Expression Matching: the Virtual Machine Approach". swtch.com.
  7. "openresty/sregex: A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams". OpenResty. 6 February 2024.