RE2 (software)

Last updated
RE2
Original author(s) Google
Initial releaseMarch 11, 2010;14 years ago (2010-03-11) [1]
Stable release
2021-04-01 / April 1, 2021;3 years ago (2021-04-01) [2]
Repository
Written in C++
Operating system Cross-platform
Type Pattern matching library
License BSD
Website github.com/google/re2 OOjs UI icon edit-ltr-progressive.svg

RE2 is a software library for regular expressions via a finite-state machine using automata theory, in contrast to almost all other regular expression libraries, which use backtracking implementations. It provides a C++ interface.

Contents

RE2 was implemented and used by Google. The library uses an "on-the-fly" deterministic finite-state automaton algorithm based on Ken Thompson's Plan 9 grep. [3]

Comparison to PCRE

RE2 generally compares to Perl Compatible Regular Expressions (PCRE) in performance. For certain regular expression operators like | (logical disjunction or boolean "or") it exceeds PCRE. On the other hand, RE2 does not support back-references because the Thompson DFA [3] algorithm cannot implement those efficiently. It is also slightly slower than PCRE for parenthetic capturing operations.

PCRE can use a large recursive stack with corresponding high memory use and have exponential runtime on certain patterns. In contrast, RE2 uses a fixed stack and guarantees that run-time increases linearly (not exponentially) with the size of the input. The maximum memory allocated with RE2 is configurable.

RE2 has a slightly smaller set of features than PCRE, but has very predictable run-time and a maximum memory allotment. This makes it suitable for use in server applications, which require boundaries on memory usage and computational time. PCRE, on the other hand, has almost all of the features that a regular expression library can have, but has unpredictable run-time and memory usage and can grow unbounded.

Adoption

Use in Google products

RE2 is used by Google products like Gmail, Google Documents and Google Sheets. [4] See GitHub for a documentation of the syntax: RE2 syntax.

In Google Sheets, it is used in the functions RegexMatch(), RegexReplace(), RegexExtract() and the find and replace feature. RegexExtract(), does not use grouping.

The RE2 algorithm has been rewritten in Rust as the package "regex". CloudFlare's web application firewall uses this package because the RE2 algorithm is immune to ReDoS. [5]

Russ Cox also wrote RE1, an earlier regular expression based on a bytecode interpreter. [6] OpenResty uses a RE1 fork called "sregex". [7]

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

grep is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect. grep was originally developed for the Unix operating system, but later available for all Unix-like systems and some others such as OS-9.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters ".

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

Google Code Search was a free beta product from Google which debuted in Google Labs on October 5, 2006, allowing web users to search for open-source code on the Internet. Features included the ability to search using operators, namely lang:, package:, license:, and file:.

A syntactic predicate specifies the syntactic validity of applying a production in a formal grammar and is analogous to a semantic predicate that specifies the semantic validity of applying a production. It is a simple and effective means of dramatically improving the recognition strength of an LL parser by providing arbitrary lookahead. In their original implementation, syntactic predicates had the form “( α )?” and could only appear on the left edge of a production. The required syntactic condition α could be any valid context-free grammar fragment.

STklos is a Scheme implementation that succeeded STk. It is a bytecode compiler with an ad hoc virtual machine which aims to be fast as well as light.

This is a comparison of regular expression engines.

A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

References

  1. Cox, Russ (March 11, 2010). "RE2: a principled approach to regular expression matching". Google Open Source Blog. Retrieved 2020-05-29.
  2. "Releases". Github. Retrieved 2021-05-03.
  3. 1 2 Cox, Russ. "Regular Expression Matching in the Wild". swtch.com.
  4. "Search and use find and replace" . Retrieved 24 March 2020.
  5. "Making the WAF 40% faster". The Cloudflare Blog. 1 July 2020.
  6. "Regular Expression Matching: the Virtual Machine Approach". swtch.com.
  7. "openresty/sregex: A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams". OpenResty. 6 February 2024.