Comparison of regular expression engines

Last updated

This is a comparison of regular expression engines.

Contents

Libraries

List of regular expression libraries
NameOfficial website Programming language Software license Used by
Boost.Regex [Note 1] Boost C++ Libraries C++ Boost Notepad++ >= 6.0.0, EmEditor
Boost.Xpressive Boost C++ Libraries C++Boost 
DEELX RegExLab C++ Proprietary  
FREJ [Note 2] Fuzzy Regular Expressions for Java Java LGPL  
GLib/GRegex [Note 3] GLib reference manual C LGPL 
GNU regex Gnulib reference manual CLGPL GNU libc, GNU programs
GRETA Microsoft Research C++ Proprietary  
Gregex Grovf Inc. RTL, HLS Proprietary FPGA accelerated >100 Gbit/s regex engine for cybersecurity, financial, e-commerce industries.
Hyperscan Intel C, x86-specific assembly (SSSE3+ [1] )3-clause BSD Rspamd
ICU International Components for Unicode C, C++ [Note 4] ICU Foundation (Apple and Swift open-source versions)
Jakarta Regexp The Apache Jakarta Project Java Apache  
java.util.regex Java's User manual JavaGNU GPLv2 with Classpath exception jEdit
JRegex JRegex JavaBSD 
MATLAB Regular Expressions MATLAB Language Proprietary  
Oniguruma Kosako CBSD Atom, Take Command Console, Tera Term, TextMate, Sublime Text, SubEthaEdit, EmEditor and jq
PattwoStevesoftJava (compatible with Java 1.0)LGPL 
PCRE pcre.org C, C++ [Note 5] BSD Apache HTTP Server, Nginx, BBEdit, Edbrowse, Julia, HHVM, Notepad++ < 6.0.0, PHP, Delphi, R, Exim SWI-Prolog
Qt/QRegExp Digia Archived 2013-12-12 at the Wayback Machine C++ Qt GNU GPL v. 3.0,

Qt GNU LGPL v. 2.1, Qt Commercial

Kate, Kile
regex - Henry Spencer's regular expression libraries ArgList CBSD 
RE2 RE2 C++BSD Go, Google Sheets, Gmail, G Suite
Henry Spencer's Advanced Regular Expressions Tcl CBSD 
RGX RGX C++ based component library P6R  
RXP Titan IC RTL Proprietary hardware-accelerated search acceleration using RegEx available for ASIC, FPGA and cloud. Enables massively parallel content processing at ultra-high speeds.
SubReg Matt Bucknall CMIT 
TPerlRegEx TPerlRegEx VCL Component Object Pascal MPLv1.1  
TRE [Note 2] Ville Laurikari CBSD musl
TRegExpr TRegExpr, documentation,

(RegExp Studio)

Object PascalDual-license: freeware, or LGPL with static linking exception Total Commander
Wolfram Language (Mathematica) Wolfram Language Documentation Center Wolfram Language Proprietary Mathematica, the Wolfram Development Platform
XRegExp XRegExp JavaScript MIT  
  1. Formerly called Regex++.
  2. 1 2 One of fuzzy regular expression engines.
  3. Included since version 2.13.0.
  4. ICU4J, the Java version, does not support regular expressions.
  5. C++ bindings were developed by Google and became officially part of PCRE in 2006.

Languages

List of languages and frameworks including regular expression support
LanguageOfficial website Software license Remarks
ActionScript 3 ActionScript Technology Center Free
APL (APLX, Dyalog, GNU) APL Wiki Licensed by the respective implementation⎕SS (PCRE), ⎕R/⎕S (PCRE), ⎕SS (PCRE2), respectively
C++11 (C++) C++ standards website Licensed by the respective implementationSince ISO14822:2011(e), similar to ECMAScript on default (Grammar Description)
D D Boost Software License [Note 1]
Free Pascal (Object Pascal) freepascal.org LGPL with static linking exceptionFree Pascal 2.6+ ships with TRegExpr from Sorokin and two other regular expression libraries; See wiki.lazarus.freepascal.org/Regexpr.
Go Golang.org BSD-style
Haskell Haskell.org BSD3 Omitted in the language report, and in GHC's Hierarchical Libraries
Java Java GNU General Public License REs are written as strings in source code: all backslashes must be doubled, harming readability.
JavaScript (ECMAScript) ECMA-262 BSD3 Limited but REs are first-class citizens of the language with a specific /.../mod syntax.
Julia JuliaLang.org MIT License REs are part of the language core library using PCRE built-in and an optional wrapper for (C code) ICU is available.
Lua Lua.org MIT License Uses simplified, limited dialect; can be bound to more powerful library, like PCRE or an alternative parser like LPeg.
Mathematica Wolfram Proprietary
.NET MSDN MIT License [Note 2] [Note 3]
Nim nim-lang.org MIT License Standard library includes PCRE-based re and nre modules, as well as various alternatives (ex. strutils, pegs (Parsing Expression Grammar matching), strscans, parseutils, etc.).
OCaml Caml LGPL As of 2010, the standard module is generally regarded as deprecated; [2] often recommended libraries are pcre (with full support for PCRE) and re (which is not as complete but claims better performance and provides frontends to popular syntaxes: PCRE, Perl, Posix, Emacs, shell globbing).
Perl Perl.com Artistic License, or GNU General Public License Full, central part of the language
PHP PHP.net PHP License Has two implementations, with PCRE being the more efficient in speed, functions
POSIX C (C) POSIX.1 web publication Licensed by the respective implementationSupports POSIX BRE and ERE syntax
Python python.org Python Software Foundation License Python has two major implementations, the built in re and the regex library.
Ruby ruby-doc.org GNU Library General Public License Ruby 1.8, Ruby 1.9, and Ruby 2.0 and later versions use different engines; Ruby 1.9 integrates Oniguruma, Ruby 2.0 and later integrate Onigmo, a fork from Oniguruma.
Rust docs.rs MIT License The primary regex crate does not allow look-around expressions. There is an Oniguruma binding called onig that does.
SAP ABAP SAP.com Proprietary
Tcl tcl.tk Tcl/Tk License
(BSD-style)
Tcl library doubles as a regular expression library.
Wolfram Language Wolfram Research Proprietary: usable for free on a limited scale on the Wolfram Development platform
XML Schema W3C Licensed by the respective implementation
XPath 3/XQuery W3C Licensed by the respective implementation
  1. "STD.regex - D Programming Language - Digital Mars".
  2. "Dotnet/Corefx". GitHub . 16 February 2022.
  3. "Dotnet/Corefx". GitHub . 16 February 2022.

Language features

NOTE: An application using a library for regular expression support does not necessarily support the full set of features of the library, e.g., GNU grep uses PCRE, but supports no lookahead, though PCRE does.

Part 1

Language feature comparison (part 1)
"+" quantifierNegated character classesNon-greedy quantifiers
[Note 1]
Shy groups
[Note 2]
RecursionLook-aheadLook-behindBackreferences
[Note 3]
>9 indexable captures
Boost.Regex YesYesYesYesYes [Note 4] YesYesYesYes
Boost.Xpressive YesYesYesYesYes [Note 5] YesYesYesYes
CL-PPCRE YesYesYesYesNoYesYesYesYes
EmEditor YesYesYesYesNoYesYesYesNo
FREJ No [Note 6] NoSome [Note 6] YesNoNoNoYesYes
GLib/GRegexYesYesYesYesYesYesYesYesYes
GNU grep YesYesYesYesNoYesYesYes
Haskell YesYesYesYesNoYesYesYesYes
RXP YesYesYesYesNoNoNoYesYes
ICU RegexYesYesYesYesNoYesYesYesYes
Java YesYesYesYesNoYesYesYesYes
JavaScript (ECMAScript)YesYesYesYesNoYesYes [Note 7] YesYes
JGsoft YesYesYesYesYes [3] YesYesYesYes
Lua YesYesSome [Note 8] NoNoNoNoYesNo
.NET YesYesYesYesNoYesYesYesYes
OCaml YesYesNoNoNoNoNoYesNo
PCRE YesYesYesYesYesYesYesYesYes
Perl YesYesYesYesYesYesYesYesYes
PHP YesYesYesYesYesYesYesYesYes
Python YesYesYesYesYes [Note 9] YesYesYesYes
Qt/QRegExpYesYesYesYesNoYesNoYesYes
RE2 YesYesYesYesNoNoNoNoYes
Ruby, Onigmo YesYesYesYesYesYesYesYesYes
TRE YesYesYesYesNoNoNoYesNo
Vim YesYesYesYesNoYesYesYesNo
RGXYesYesYesYesNoYesYesYesYes
Tcl YesYesYesYesNoYesYesYesYes
TRegExprYes ?Yes ? ? ? ? ? ?
XML Schema YesYesNoNoNoNoNo
XPath 3/XQuery YesYesYesYesNoNoNoYesYes
XRegExp YesYesYesYesNoYesYes [Note 7] YesYes
  1. Non-greedy quantifiers match as few characters as possible, instead of the default as many. Note that many older, pre-POSIX engines were non-greedy and didn't have greedy quantifiers at all.
  2. Shy groups, also called non-capturing groups cannot be referred to with backreferences; non-capturing groups are used to speed up matching where the group's content does not need to be accessed later.
  3. Backreferences enable referring to previously matched groups in later parts of the regex and/or replacement string (where applicable). For instance, ([ab]+)\1 matches "abab" but not "abaab".
  4. "Perl Regular Expression Syntax - 1.47.0".
  5. "User's Guide - 1.47.0".
  6. 1 2 FREJ have no repetitive quantifiers, but have "optional" element which behaves similar to simple "?" quantifier.
  7. 1 2 As of ES2018
  8. Lua's only non-greedy quantifier is -, which is a non-greedy version of *. It does not have non-greedy versions of + or ?; in the former case, the non-greedy effect can be achieved by repeating the token followed by -, but in the latter case, there is no equivalent.
  9. Supported by the optional regex library only.

Part 2

Language feature comparison (part 2)
Directives
[Note 1]
ConditionalsAtomic groups
[Note 2]
Named capture
[Note 3]
CommentsEmbedded code Unicode property support [4] Balancing groups
[Note 4]
Variable-length look-behinds
[Note 5]
Boost.Regex YesYesYesYesYesNoSome [Note 6] NoNo
Boost.Xpressive YesNoYesYesYesNoNoNoNo
CL-PPCRE YesYesYesYesYesYesSome [Note 6] NoNo
EmEditor YesYes ? ?YesNo ?NoNo
FREJ NoNoYesYesYesNo ?NoNo
GLib/GRegexYesYesYesYesYesNoSome [Note 6] NoNo
GNU grep YesYes ?YesYesNoNoNoNo
Haskell  ? ? ? ? ?NoNoNoNo
RXP YesYesNoYesYesNoNoNoNo
ICU RegexYesNoYesYes [Note 7] YesNoYesNoNo
Java YesNoYesYes [Note 8] YesNoSome [Note 6] NoNo
JavaScript (ECMAScript)NoNoNoYesNoNoSome [Note 6] [Note 9] [5] NoYes
JGsoft YesYesYesYesYesNoSome [Note 6] NoYes
Lua NoNoNoNoNoNoNoNoNo
.NET YesYesYesYesYesNoSome [Note 6] YesYes
OCaml NoNoNoNoNoNoNoNoNo
PCRE YesYesYesYesYesYesYesNoNo
Perl YesYesYesYesYesYesYesNoNo [Note 10]
PHP YesYesYesYesYesNoNoNoNo
Python YesYesYes [Note 11] YesYesNoYes [Note 12] NoYes [Note 13]
Qt/QRegExpNoNoNoNoNoNoNoNoNo
RE2 YesNo ?YesNoNoSome [Note 6] NoNo
Ruby, Onigmo YesYesYesYesYesNoSome [Note 6] NoNo
Tcl YesNoYesNoYesNoYesNoNo
TRE YesNoNoNoYesNo ?NoNo
Vim YesNoYesNoNoNoNoNoYes
RGXYesYesYesYesYesNoYesNoNo
XML Schema NoNoNoNoNoNoYesNoNo
XPath 3/XQuery NoNoNoNoNoNoYesNoNo
XRegExp Leading onlyNoNoYesYesNoYesNoYes
  1. Also known as flags modifiers, modes modifiers or option letters. Example pattern: "(?i:test)".
  2. Also called independent sub-expressions.
  3. Similar to back references, but with names instead of indices.
  4. Special feature allowing to match balanced constructs without recursion.
  5. Refers to the possibility of including quantifiers in look-behinds, thus making their length unpredictable.
  6. 1 2 3 4 5 6 7 8 9 Unicode property support may be incomplete (products are continuously updated!). All will be incomplete when a new Unicode revision is released until they are updated to comply.
  7. Available as of ICU55.
  8. Available as of JDK7.
  9. The support and range of properties is dependent on implementation.
  10. Experimental support added in v5.29.9.
  11. Supported by Python v3.11 and later, and the optional regex library only.
  12. May only be available in the regex library when used with Python versions after 3.3.
  13. Supported by the optional regex library only.

API features

API feature comparison
Native UTF-16 support [Note 1] Native UTF-8 support [Note 1] Multi-line matchingPartial match [Note 2]
Boost.Regex NoNoYesYes
GLib/GRegexYesYesYesYes
RXP YesYesNoYes
ICU RegexYesNoYes ?
Java Yes [Note 3] Yes [Note 3] YesYes
.NET No [Note 4] YesYes ?
PCRE Yes [Note 5] YesYesYes
Qt/QRegExpYesNoNoYes [Note 6]
Qt/QRegularExpressionYesYesYesYes
Tcl YesYes [Note 7] Yes ?
TRE YesYesYes ?
RGXNoNoYes ?
wxWidgets::wxRegEx [Note 8] YesYesYes ?
XRegExp YesYesYesNo
  1. 1 2 Means the format can be used internally without explicit conversion.
  2. Partial match of the whole regular expression. For example the pattern ".*END$" will match any string partially, but only strings ending with END fully..
  3. 1 2 Supports Unicode 15.0 standard from 2023..
  4. Implementation uses original UCS-2 support/features, so it only recognizes 64K chars total (vs UTF-16's 1,112,064 characters). A Microsoft developer-representative answered a bug report on this as "will not fix" in 2010..
  5. Since version 8.30.
  6. Partial matching is performed implicitly, requiring a separate call to matchedLength() if an exact match fails.
  7. Tcl includes facilities to convert to and from UTF-8.
  8. wxRegEx uses any system supplied POSIX library or if not available and for Unicode mode uses Henry Spencer's library.

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

A metacharacter is a character that has a special meaning to a computer program, such as a shell interpreter or a regular expression (regex) engine.

JScript is Microsoft's legacy dialect of the ECMAScript standard that is used in Microsoft's Internet Explorer web browser.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters ".

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.

This article provides basic comparisons for notable text editors. More feature details for text editors are available from the Category of text editor features and from the individual products' articles. This article may not be up-to-date or necessarily all-inclusive.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

<span class="mw-page-title-main">C Sharp (programming language)</span> Programming language

C# is a general-purpose high-level programming language supporting multiple paradigms. C# encompasses static typing, strong typing, lexically scoped, imperative, declarative, functional, generic, object-oriented (class-based), and component-oriented programming disciplines.

C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

BSON is a computer data interchange format. The name "BSON" is based on the term JSON and stands for "Binary JSON". It is a binary form for representing simple or complex data structures including associative arrays, integer indexed arrays, and a suite of fundamental scalar types. BSON originated in 2009 at MongoDB. Several scalar data types are of specific interest to MongoDB and the format is used both as a data storage and network transfer format for the MongoDB database, but it can be used independently outside of MongoDB. Implementations are available in a variety of languages such as C, C++, C#, D, Delphi, Erlang, Go, Haskell, Java, JavaScript, Julia, Lua, OCaml, Perl, PHP, Python, Ruby, Rust, Scala, Smalltalk, and Swift.

A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

RE2 is a software library which implements a regular expression engine via finite-state machines using automata theory, in contrast to almost all other regular expression libraries, which use backtracking implementations. It provides a C++ interface.

musl Implementation of C standard library for Linux operating system

musl is a C standard library intended for operating systems based on the Linux kernel, released under the MIT License. It was developed by Rich Felker to write a clean, efficient, and standards-conformant libc implementation.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

References