TRE (computing)

Last updated
TRE
Original author(s) Ville Laurikari [1]
Repository
Written in C
Type Approximate string matching
License 2-clause BSD-like license
Website laurikari.net/tre/

TRE is an open-source library for pattern matching in text, [2] which works like a regular expression engine with the ability to do approximate string matching. [3] It was developed by Ville Laurikari [1] and is distributed under a 2-clause BSD-like license.

Contents

The library [4] is written in C and provides functions which allow using regular expressions for searching over input text lines. The main difference from other regular expression engines is that TRE can match text fragments in an approximate way, that is, supposing that text could have some number of typos.

Features

TRE uses extended regular expression syntax with the addition of "directions" for matching preceding fragment in approximate way. Each of such directions specifies how many typos are allowed for this fragment.

Approximate matching [5] is performed in a way similar to Levenshtein distance, which means that there are three types of typos 'recognized': [6]

TypoExampleData
insertion of an extra characterregullarexperessionextra l, extra e
missing a character from patternreglarexpessionmissing u, missing r
replacement of some characterregolarexprezsionuo, sz

TRE allows specifying of cost for each of three typos type independently.

The project comes with a command-line utility, a reimplementation of agrep.

Though approximate matching requires some syntax extension, when this feature is not used, TRE works like most of other regular expression matching engines. This means that

Predictable time and memory consumption

The library's author states [8] that time spent for matching grows linearly with increasing of input text length, while memory requirement is constant during matching and does not depend on the input, only on the pattern.

Other

Other features, common for most regular expression engines could be checked in regex engines comparison tables or in list of TRE features on its web-page.

Usage example

Approximate matching directions are specified in curly brackets and should be distinguishable from repetitive quantifiers (possibly with inserting a space after opening bracket):

Language bindings

Apart from C, TRE is usable through bindings for Perl, Python and Haskell. [9] It is the default regular expression engine in R. [10] However, if the project should be cross-platform, each target platform would need a separate interface.

Disadvantages

Since other regular expression engines usually do not provide approximate matching ability, there is almost no concurrent implementation with which TRE could be compared. However, there are a few things which programmers may wish to see implemented in future releases: [11]

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

grep Unix command line utility for text search

grep is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect. grep was originally developed for the Unix operating system, but later became available for all Unix-like systems and some others such as OS-9.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

agrep is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

<span class="mw-page-title-main">Levenshtein distance</span> Computer science metric for string similarity

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

glob (programming) Patterns used in computer programming

glob is a libc function for globbing, which is the archetypal use of pattern matching against the names in a filesystem directory such that a name pattern is expanded into a list of names matching that pattern. Although globbing may now refer to glob -style pattern matching of any string, not just expansion into a list of filesystem names, the original meaning of the term is still widespread.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

The bitap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

In computer science, a Levenshtein automaton for a string w and a number n is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

This is a comparison of regular expression engines.

A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.

RE2 is a software library which implements a regular expression engine. It uses finite-state machines, in contrast to most other regular expression libraries. RE2 supports a C++ interface.

In computer science, the Krauss wildcard-matching algorithm is a pattern matching algorithm. Based on the wildcard syntax in common use, e.g. in the Microsoft Windows command-line interface, the algorithm provides a non-recursive mechanism for matching patterns in software applications, based on syntax simpler than that typically offered by regular expressions.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

References

  1. 1 2 "R: Pattern Matching for Raw Vectors". MIT.edu.
  2. "Tre for Windows".
  3. 1 2 3 "Using fuzzy searches with tre-agrep". Linux Magazine.
  4. 1 2 "tre 0.8.0-6 (x86_64)". July 7, 2020.
  5. Andoni, Alexandr; Krauthgamer, Robert; Onak, Krzysztof (2010). Polylogarithmic approximation for edit distance and the asymmetric query complexity. IEEE Symp. Foundations of Computer Science (FOCS). arXiv: 1005.4033 . Bibcode:2010arXiv1005.4033A. CiteSeerX   10.1.1.208.2079 .
  6. "TRE web-page - Regex Syntax".
  7. "Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy"
  8. "TRE web-page - About".
  9. "TRE web-page - FAQ".
  10. "Regular Expressions as used in R".
  11. Trofimovich, Ulya (2019). "Tagged Deterministic Finite Automata with Lookahead". arXiv: 1907.08837 [cs.FL]. practical improvements .. Lurikari algorithm, notably ..

Further reading