Agrep

Last updated
agrep
Developer(s)
Initial release1988;33 years ago (1988)
Stable release
3.41.5
Repository
Written inC
Operating system
Type Pattern matching
License ISC open source license
Website www.tgries.de/agrep

agrep (approximate grep) is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, [1] for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

Contents

It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including Manber and Wu's bitap algorithm based on Levenshtein distances.

agrep is also the search engine in the indexer program GLIMPSE. agrep is under a free ISC License. [2]

Alternative implementations

A more recent agrep is the command-line tool provided with the TRE regular expression library. TRE agrep is more powerful than Wu-Manber agrep since it allows weights and total costs to be assigned separately to individual groups in the pattern. It can also handle Unicode. [3] Unlike Wu-Manber agrep, TRE agrep is licensed under a 2-clause BSD-like license.

FREJ (Fuzzy Regular Expressions for Java) open-source library provides command-line interface which could be used in the way similar to agrep. Unlike agrep or TRE it could be used for constructing complex substitutions for matched text. [4] However its syntax and matching abilities differs significantly from ones of ordinary regular expressions.

See also

Related Research Articles

AWK (awk) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

Regular expression Sequence of characters that forms a search pattern

A regular expression is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. sed was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

SNOBOL is a series of programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky, culminating in SNOBOL4. It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.

grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect. grep was originally developed for the Unix operating system, but later available for all Unix-like systems and some others such as OS-9.

The cron command-line utility, also known as cron job is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs to run periodically at fixed times, dates, or intervals. It typically automates system maintenance or administration—though its general-purpose nature makes it useful for things like downloading files from the Internet and downloading email at regular intervals.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txt textfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard standing for "any string of characters" and *.txt is a glob pattern. The other common wildcard is the question mark (?), which stands for one character. For example, mv ?.txt shorttextfiles/ will move all files named with a single character followed by .txt from the current directory to directory shorttextfiles, while ??.txt would match all files whose name consists of 2 characters followed by .txt.

The bitap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

In computer science, a Levenshtein automaton for a string w and a number n is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.

Approximate string matching finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

expr is a command line utility on Unix and Unix-like operating systems which evaluates an expression and outputs the corresponding value. It first appeared in Unix v7. The expr command has also been ported to the IBM i operating system.

Template processor Software designed to combine templates with a data model to produce result documents

A template processor is software designed to combine templates with a data model to produce result documents. The language that the templates are written in is known as a template language or templating language. For purposes of this article, a result document is any kind of formatted output, including documents, web pages, or source code, either in whole or in fragments. A template engine is ordinarily included as a part of a web template system or application framework, and may be used also as a preprocessor or filter.

This is a comparison of regular-expression engines.

CUBRID ( "cube-rid") is an open-source SQL-based relational database management system (RDBMS) with object extensions developed by CUBRID Corp. for OLTP. The name CUBRID is a combination of the two words cube and bridge, cube standing for a space for data and bridge standing for data bridge.

ngrep

ngrep is a network packet analyzer written by Jordan Ritter. It has a command-line interface, and relies upon the pcap library and the GNU regex library.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

The following outline is provided as an overview of and topical guide to the Perl programming language:

Brenda Sue Baker is an American computer scientist. She is known for Baker's technique for approximation algorithms on planar graphs, for her early work on duplicate code detection, and for her research on two-dimensional bin packing problems.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

References

  1. Wu, Sun; Manber, Udi (20–24 January 1992). Agrep -- a fast approximate pattern-matching tool. 1992 Winter USENIX Conference. San Francisco, California. CiteSeerX   10.1.1.89.5424 .
  2. WebGlimpse, Glimpse and also AGREP license since 18.09.2014 (ISC License).
  3. "TRE - TRE regexp matching package - Features".
  4. "FREJ - Fuzzy Regular Expressions for Java - Guide and Examples".