This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these messages)
|
Original author(s) | |
---|---|
Initial release | March 11, 2010 [1] |
Stable release | 2021-04-01 / April 1, 2021 [2] |
Repository | |
Written in | C++ |
Operating system | Cross-platform |
Type | Pattern matching library |
License | BSD |
Website | github |
RE2 is a software library which implements a regular expression engine via finite-state machines using automata theory, in contrast to almost all other regular expression libraries, which use backtracking implementations. It provides a C++ interface.
RE2 was implemented and used by Google. The library uses an "on-the-fly" deterministic finite-state automaton algorithm based on Ken Thompson's Plan 9 grep. [3]
RE2 generally compares to Perl Compatible Regular Expressions (PCRE) in performance. For certain regular expression operators like |
(the operator for alternation or logical disjunction) it exceeds PCRE. On the other hand, unlike PCRE which supports features such as backreferences, RE2 is only able to recognize regular languages due to its construction using the Thompson DFA [3] algorithm. It is also slightly slower than PCRE for parenthetic capturing operations.
PCRE can use a large recursive stack with corresponding high memory usage and result in exponential runtime on certain patterns. In contrast, RE2 uses a fixed stack size and guarantees that its runtime increases linearly (not exponentially) with the size of the input. The maximum memory allocated with RE2 is configurable.
RE2 has a slightly smaller set of features than PCRE, but has very predictable run-time and a maximum memory allotment. This can make it more suitable for use in server applications, which require boundaries on memory usage and computational time. PCRE, on the other hand supports more features such as lookarounds, backreferences and recursion, but has unpredictable runtime and memory usage which can grow unbounded.
RE2 is used by Google products like Gmail, Google Documents and Google Sheets. [4] See GitHub for a documentation of the syntax: RE2 syntax.
In Google Sheets, it is used in the functions RegexMatch(), RegexReplace(), RegexExtract() and the find and replace feature. RegexExtract(), does not use grouping.
The built-in "regexp" package uses the same patterns and implementation as RE2, though it is written in Go. [5] This is unsurprising, given Go's background from Google and common staff from the Plan 9 team.
The RE2 algorithm has been rewritten in Rust as the package "regex". CloudFlare's web application firewall uses this package because the RE2 algorithm is immune to ReDoS. [6]
Russ Cox also wrote RE1, an earlier regular expression based on a bytecode interpreter. [7] OpenResty uses a RE1 fork called "sregex". [8]
A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.
SNOBOL is a series of programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph Griswold and Ivan P. Polonsky, culminating in SNOBOL4. It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.
grep
is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p
, which has the same effect. grep
was originally developed for the Unix operating system, but later became available for all Unix-like systems and some others such as OS-9.
In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.
agrep is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.
Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.
In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/
moves all files with names ending in .txt
from the current directory to the directory textfiles
. Here, *
is a wildcard and *.txt
is a glob pattern. The wildcard *
stands for "any string of any length including empty, but excluding the path separator characters ".
Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.
In computer programming, leaning toothpick syndrome (LTS) is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes ("\"), to avoid delimiter collision.
Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.
Google Code Search was a free beta product from Google which debuted in Google Labs on October 5, 2006, allowing web users to search for open-source code on the Internet. Features included the ability to search using operators, namely lang:, package:, license:, and file:.
This is a comparison of regular expression engines.
The following is a comparison of TeX editors.
A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.
TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.
Racket has been under active development as a vehicle for programming language research since the mid-1990s, and has accumulated many features over the years. This article describes and demonstrates some of these features. Note that one of Racket's main design goals is to accommodate creating new programming languages, both domain-specific languages and completely new languages. Therefore, some of the following examples are in different languages, but they are all implemented in Racket. Please refer to the main article for more information.
re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.
In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.
RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.