Wildmat

Last updated

wildmat
Developer(s) Rich Salz
Type Pattern matching

wildmat is a pattern matching library developed by Rich Salz. Based on the wildcard syntax already used in the Bourne shell, wildmat provides a uniform mechanism for matching patterns across applications with simpler syntax than that typically offered by regular expressions. Patterns are implicitly anchored at the beginning and end of each string when testing for a match.

Contents

In June 2019, Rich Salz released the original version of the now-defunct library on GitHub under a public domain dedication. [1]

Pattern matching operations

There are five pattern matching operations other than a strict one-to-one match between the pattern and the source to be checked for a match.

Examples

Usage

wildmat is most commonly seen in NNTP implementations such as Salz's own INN, also in unrelated software such as GNU tar and Transmission. GNU tar replaced wildmat with the POSIX fnmatch glob matcher in September 1992. The early version contained a potential out-of-bound access on unclosed [. [2]

The original byte oriented wildmat implementation is unable to handle multibyte character sets, and poses problems when the text being searched may contain multiple incompatible character sets. A simplified version of wildmat oriented toward UTF-8 encoding has been developed by the IETF NNTP working group. It is a part of RFC   3977 (section 4), the 2006 standard for NNTP.

In the newer INN which supports UTF-8, a "uwildmat" was added which supports all the features of wildmat. This 2000 rewrite, performed by Russ Allbery, fixes the OOB in the original implementation. Tightly-wound C loops were written out into smaller statements. [3] [4]

Rsync includes a GPLv3-licensed wildmat descendant known as wildmatch, modified by Wayne Davison. The Git version control system imports and makes use of it. It does not support UTF-8, but has the OOB fixed and has additional support for character classes and star globs (** for arbitrary-depth). [5]

See also

Related Research Articles

Regular expression Sequence of characters that forms a search pattern

A regular expression is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

rn is a news client written by Larry Wall and originally released in 1984. It was one of the first newsreaders to take full advantage of character-addressable CRT terminals. Previous newsreaders, such as readnews, were mostly line-oriented and designed for use on the printing terminals which were common on the early Unix minicomputers where the Usenet software and network originated. Later variants of the original rn program included rrn, trn, and strn.

In software, a wildcard character is a kind of placeholder represented by a single character, such as an asterisk, which can be interpreted as a number of literal characters or an empty string. It is often used in file searches so the full name need not be typed.

The prototype pattern is a creational design pattern in software development. It is used when the type of objects to create is determined by a prototypical instance, which is cloned to produce new objects. This pattern is used to:

Almquist shell is a lightweight Unix shell originally written by Kenneth Almquist in the late 1980s. Initially a clone of the System V.4 variant of the Bourne shell, it replaced the original Bourne shell in the BSD versions of Unix released in the early 1990s.

News server Type of server software

A news server is a collection of software used to handle Usenet articles. It may also refer to a computer itself which is primarily or solely used for handling Usenet. Access to Usenet is only available through news server provider.

A string literal or anonymous string is a type of literal in programming for the representation of a string value within the source code of a computer program. Most often in modern languages this is a quoted sequence of characters, as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value, and one must use a method such as escape sequences to avoid the problem of delimiter collision and allow the delimiters themselves to be embedded in a string. However, there are numerous alternate notations for specifying string literals, particularly more complicated cases, and the exact notation depends on the individual programming language in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to addr-spec in RFC 5322, not to address or mailbox; i.e., a raw address without a display-name.

A path, the general form of the name of a file or directory, specifies a unique location in a file system. A path points to a file system location by following the directory tree hierarchy expressed in a string of characters in which path components, separated by a delimiting character, represent each directory. The delimiting character is most commonly the slash ("/"), the backslash character ("\"), or colon (":"), though some operating systems may use a different delimiter. Paths are used extensively in computer science to represent the directory/file relationships common in modern operating systems, and are essential in the construction of Uniform Resource Locators (URLs). Resources can be represented by either absolute or relative paths.

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txt textfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard standing for "any string of characters" and *.txt is a glob pattern. The other common wildcard is the question mark (?), which stands for one character. For example, mv ?.txt shorttextfiles/ will move all files named with a single character followed by .txt from the current directory to directory shorttextfiles, while ??.txt would match all files whose name consists of 2 characters followed by .txt.

In Unix-like and some other operating systems, find is a command-line utility that locates files based on some user-specified criteria and then applies some requested action on each matched object.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

An INI file is a configuration file for computer software that consists of a text-based content with a structure and syntax comprising key–value pairs for properties, and sections that organize the properties. The name of these configuration files comes from the filename extension INI, for initialization, used in the MS-DOS operating system which popularized this method of software configuration. The format has become an informal standard in many contexts of configuration, but many applications on other operating systems use different file name extensions, such as conf and cfg.

In computer programming, leaning toothpick syndrome (LTS) is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes ("\"), to avoid delimiter collision.

Control messages are a special kind of Usenet post that are used to control news servers. They differ from ordinary posts by a header field named Control. The body of the field contains control name and arguments.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

In computing, findstr is a command in the command-line interpreters (shells) of Microsoft Windows and ReactOS. It is used to search for a specific text string in computer files.

In computer science, the Krauss wildcard-matching algorithm is a pattern matching algorithm. Based on the wildcard syntax in common use, e.g. in the Microsoft Windows command-line interface, the algorithm provides a non-recursive mechanism for matching patterns in software applications, based on syntax simpler than that typically offered by regular expressions.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

References

  1. Salz, Rich (25 June 2019). "wildmat: The hoary classic wildmat pattern matcher; public domain" . Retrieved 25 November 2019.CS1 maint: discouraged parameter (link)
  2. Salz, Rich (25 June 2019). "wildmat.c". Might not be robust in face of malformed patterns; e.g., "foo[a-" could cause a segmentation violation.
  3. uwildmat(3)    Linux Library Functions Manual
  4. "uwildmat.c in trunk/lib – INN". inn.eyrie.org. Retrieved 27 November 2019.CS1 maint: discouraged parameter (link)
  5. "git/git: wildmatch.c". GitHub.