RE2 (software)

Last updated
RE2
Original author(s) Google
Initial releaseMarch 11, 2010;14 years ago (2010-03-11) [1]
Stable release
2021-04-01 / April 1, 2021;3 years ago (2021-04-01) [2]
Repository
Written in C++
Operating system Cross-platform
Type Pattern matching library
License BSD
Website github.com/google/re2 OOjs UI icon edit-ltr-progressive.svg

RE2 is a software library which implements a regular expression engine. It uses finite-state machines, in contrast to most other regular expression libraries. RE2 supports a C++ interface.

Contents

RE2 was implemented by Google and Google uses RE2 for Google products. [3] RE2 uses an "on-the-fly" deterministic finite-state automaton algorithm based on Ken Thompson's Plan 9 grep. [4]

Comparison to PCRE

RE2 performs comparably to Perl Compatible Regular Expressions (PCRE). For certain regular expression operators like | (the operator for alternation or logical disjunction) it is superior to PCRE. Unlike PCRE, which supports features such as lookarounds, backreferences and recursion, RE2 is only able to recognize regular languages due to its construction using the Thompson DFA [4] algorithm. It is also slightly slower than PCRE for parenthetic capturing operations.

PCRE can use a large recursive stack with corresponding high memory usage and result in exponential runtime on certain patterns. In contrast, RE2 uses a fixed stack size and guarantees that its runtime increases linearly (not exponentially) with the size of the input. The maximum memory allocated with RE2 is configurable. This can make it more suitable for use in server applications, which require boundaries on memory usage and computational time.

Adoption

Use in Google products

RE2 is available to users of Google Docs and Google Sheets. [5] Google Sheets supports RE2 except Unicode character class matching. [6] RegexExtract does not use grouping.

Use in Go

The built-in "regexp" package uses the same patterns and implementation as RE2, though it is written in Go. [7] This is unsurprising, given Go's common staff from the Plan 9 team.

The RE2 algorithm has been rewritten in Rust as the package "regex". CloudFlare's web application firewall uses this package because the RE2 algorithm is immune to ReDoS. [8]

Russ Cox also wrote RE1, an earlier regular expression based on a bytecode interpreter. [9] OpenResty uses a RE1 fork called "sregex". [10]

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

grep is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect. grep was originally developed for the Unix operating system, but later became available for all Unix-like systems and some others such as OS-9.

agrep is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters ".

<span class="mw-page-title-main">Deterministic finite automaton</span> Finite-state machine

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

The Multidimensional hierarchical toolkit or Multi-Dimensional and Hierarchical (MDH) Database Toolkit is a Linux-based, open-sourced, toolkit of portable software that supports very fast, flexible, multi-dimensional and hierarchical storage, retrieval and manipulation of information in databases ranging in size up to 256 terabytes. The package is written in C and C++ and is available under the GNU GPL/LGPL/Free Documentation licenses in source code form. The distribution kit contains demonstration implementations of network-capable, interactive text and sequence retrieval tools that function with very large genomic data bases and illustrate the toolkit's capability to manipulate massive data sets of genomic information.

In computer programming, leaning toothpick syndrome (LTS) is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes ("\"), to avoid delimiter collision.

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

The Parser Grammar Engine is a compiler and runtime system for Raku rules for the Parrot virtual machine. PGE uses these rules to convert a parsing expression grammar into Parrot bytecode. It is therefore compiling rules into a program, unlike most virtual machines and runtimes, which store regular expressions in a secondary internal format that is then interpreted at runtime by a regular expression engine. The rules format used by PGE can express any regular expression and most formal grammars, and as such it forms the first link in the compiler chain for all of Parrot's front-end languages.

Google Code Search was a free beta product from Google which debuted in Google Labs on October 5, 2006, allowing web users to search for open-source code on the Internet. Features included the ability to search using operators, namely lang:, package:, license:, and file:.

This is a comparison of regular expression engines.

BSON is a computer data interchange format. The name "BSON" is based on the term JSON and stands for "Binary JSON". It is a binary form for representing simple or complex data structures including associative arrays, integer indexed arrays, and a suite of fundamental scalar types. BSON originated in 2009 at MongoDB. Several scalar data types are of specific interest to MongoDB and the format is used both as a data storage and network transfer format for the MongoDB database, but it can be used independently outside of MongoDB. Implementations are available in a variety of languages such as C, C++, C#, D, Delphi, Erlang, Go, Haskell, Java, JavaScript, Julia, Lua, OCaml, Perl, PHP, Python, Ruby, Rust, Scala, Smalltalk, and Swift.

A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

re2c is a free and open-source lexer generator for C, C++, D, Go, Haskell, Java, JavaScript, OCaml, Python, Rust, V and Zig. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In the automata theory, a tagged deterministic finite automaton (TDFA) is an extension of deterministic finite automaton (DFA). In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression.

References

  1. Cox, Russ (March 11, 2010). "RE2: a principled approach to regular expression matching". Google Open Source Blog. Retrieved 2020-05-29.
  2. "Releases". Github. Retrieved 2021-05-03.
  3. "Search and use find and replace: Find and replace items using regular expressions". support.google.com. Retrieved 30 November 2024.
  4. 1 2 Cox, Russ. "Regular Expression Matching in the Wild". swtch.com.
  5. "Search and use find and replace" . Retrieved 24 March 2020.
  6. "RegMatch".
  7. "regexp package - regexp - Go Packages" . Retrieved 8 Nov 2024.
  8. "Making the WAF 40% faster". The Cloudflare Blog. 1 July 2020.
  9. "Regular Expression Matching: the Virtual Machine Approach". swtch.com.
  10. "openresty/sregex: A non-backtracking NFA/DFA-based Perl-compatible regex engine matching on large data streams". OpenResty. 6 February 2024.