Matching wildcards

Last updated

In computer science, an algorithm for matching wildcards (also known as globbing) is useful in comparing text strings that may contain wildcard syntax. [1] Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell [2] or Microsoft Windows command-line [3] or text editor or file manager, as well as the interfaces for some search engines [4] and databases. [5] Wildcard matching is a subset of the problem of matching regular expressions and string matching in general. [6]

Contents

The problem

A wildcard matcher tests a wildcard pattern p against an input string s. It performs an anchored match, returns true only when p matches the entirety of s.

The pattern can be based on any common syntax (see globbing), but on Windows programmers tend to only discuss a simplified syntax supported by the native C runtime: [7] [8]

This article mainly discusses the Windows formulation of the problem, unless otherwise stated.

Definition

Stated in zero-based indices, the wildcard-matching problem can be defined recursively as:

where mij is the result of matching the pattern p against the text t truncated at i and j characters respectively. This is the formulation used by Richter's algorithm and the Snippets algorithm found in Cantatore's collection. [9] [10] This description is similar to the Levenshtein distance.

Directly related problems in computer science include:

History

Early algorithms for matching wildcards often relied on recursion, but the technique was criticized on grounds of performance [10] and reliability [8] considerations. Non-recursive algorithms for matching wildcards have gained favor in light of these considerations.

Among both recursive and non-recursive algorithms, strategies for performing the pattern matching operation vary widely, as evidenced among the variety of example algorithms referenced below. Test case development and performance optimization techniques have been demonstrably brought to bear on certain algorithms, particularly those developed by critics of the recursive algorithms.

Recursive algorithms

The recursion generally happens on matching * when there is more suffix to match against. This is a form of backtracking, also done by some regular expression matchers.

The general form of these algorithms are the same. On recursion the algorithm slices the input into substrings, and considers a match to have happened when ONE of the substrings return a positive match. For dowild("*X", "abcX"), it would greedily call dowild("X", "abcX"), dowild("X", "bcX"), dowild("X", "cX") and dowild("X", "X"). They usually differ by less-important things like support for features and by more important factors such as minor but highly effective optimizations. Some of them include:

Martin Richter's algorithm is an exception to this pattern, although the overall operation is equivalent. On * it recurses into increasing either of the indexes, following the dynamic programming formulation of the problem. The "ABORT" technique is applicable to it as well. [9] On typical patterns (as tested by Cantatore) it is slower than the greedy-call implementations. [10]

The recursive algorithms are in general easier to reason about, and with the ABORT modification they perform acceptably in terms of worst-case complexity. On strings without * they take linear-to-string-size time to match since there is a fixed one-to-one relation.

Non-recursive algorithms

The following are developed by critics of the recursive algorithms:

The following is not:

The iterative functions above implement backtracking by saving an old set of pattern/text pointers, and reverting to it should a match fails. According to Kurt, since only one successful match is required, only one such set needs to be saved. [17]

In addition, the problem of wildcard matching can be converted into regular expression matching using a naive text-replacement approach. Although non-recursive regular expression matchers such as Thompson's construction are less used in practice due to lack of backreference support, wildcard matching in general does not come with a similarly rich set of features. (In fact, many of the algorithms above only has support for ? and *.) The Russ Cox implementation of Thompson NFA can be trivially modified for such. [26] Gustavo Navarro's BDM-based nrgrep algorithm provides a more streamlined implementation with emphasis on efficient suffixes. [27] See also regular expression § Implementations.

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In software, a wildcard character is a kind of placeholder represented by a single character, such as an asterisk, which can be interpreted as a number of literal characters or an empty string. It is often used in file searches so the full name need not be typed.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters ".

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

wildmat is a pattern matching library developed by Rich Salz. Based on the wildcard syntax already used in the Bourne shell, wildmat provides a uniform mechanism for matching patterns across applications with simpler syntax than that typically offered by regular expressions. Patterns are implicitly anchored at the beginning and end of each string when testing for a match.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, can be recognized as a sum because it can be broken into , also a sum, and , a suitable suffix.

In computer science, the Boyer–Moore–Horspool algorithm or Horspool's algorithm is an algorithm for finding substrings in strings. It was published by Nigel Horspool in 1980 as SBM.

The bitap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

This is a comparison of regular expression engines.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

In computer science, the Krauss wildcard-matching algorithm is a pattern matching algorithm. Based on the wildcard syntax in common use, e.g. in the Microsoft Windows command-line interface, the algorithm provides a non-recursive mechanism for matching patterns in software applications, based on syntax simpler than that typically offered by regular expressions.

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

References

  1. "Wildcard characters". ScienceDirect. 2018.
  2. Quigley, Ellie (2005). UNIX Shell Programming QuickStart. InformIT.com.
  3. "MS-DOS and Windows Wildcard Characters". Microsoft Developer Network Library. 31 May 2018.
  4. "Apache Lucene - Query Parser Syntax". Apache Lucene 2.9.4 Documentation. 2006.
  5. "SQL Wildcards". W3Schools. 2018.
  6. Goyvaerts, Jan (2018). "Welcome to Regular-Expressions.info". RegularExpressions.info.
  7. "Wildcard Expansion". docs.microsoft.com. 8 February 2022.
  8. 1 2 3 Krauss, Kirk (2008). "Matching Wildcards: An Algorithm". Dr. Dobb's Journal .
  9. 1 2 3 Deadlock (2015). "Wildcard Matching Recursive Algorithm C++". Stack Overflow.
  10. 1 2 3 4 Cantatore, Alessandro (2003). "Wildcard matching algorithms".
  11. Iliopoulos, Costas S.; Rahman, M. Sohel (2007). "Pattern Matching Algorithms with Don't Cares" (PDF). SOFSEM 2007: Theory and Practice of Computer Science, 33rd Conference on Current Trends in Theory and Practice of Computer Science. Harrachov, Czech Republic. S2CID   14538871. Archived from the original (PDF) on 2019-12-17.
  12. Clifford, Peter; Clifford, Raphaël (January 2007). "Simple deterministic wildcard matching". Information Processing Letters. 101 (2): 53–54. doi:10.1016/j.ipl.2006.08.002.
  13. Wu, Xindong; Qiang, Ji-Peng; Xie, Fei (12 September 2014). "Pattern Matching with Flexible Wildcards". Journal of Computer Science and Technology. 29 (5): 740–750. doi:10.1007/s11390-014-1464-3. S2CID   16824910.
  14. 1 2 Salz, Rich (1991). "wildmat.c". GitHub.
  15. Filip (2014). "Compare strings with wildcard". Stack Overflow.
  16. Murugesan, Vignesh (2014). "WildCard Matching algorithm".
  17. 1 2 3 Kurt, Dogan. "Wildcard Matching Methods".
  18. van Rossum, Guido (20 November 2019). "freebsd/lib/libc/gen/fnmatch.c". GitHub . Retrieved 21 November 2019.
  19. "fnmatch.c". opensource.apple.com. 1999.
  20. "fnmatch_internal.c". Beren Minor's Mirrors. 21 November 2019.
  21. 1 2 "git/git: wildmatch.c". GitHub. 2020-01-20.
  22. 1 2 "uwildmat.c in trunk/lib – INN". inn.eyrie.org. Retrieved 27 November 2019.
  23. Krauss, Kirk (2018). "Matching Wildcards: An Improved Algorithm for Big Data". Develop for Performance.
  24. Siler (2013). "Recursive solutions for glob pattern matching". Stack Overflow.
  25. Handy, Jack (2005). "Wildcard string compare (globbing)". Code Project.
  26. Cox, Ross. "Regular Expression Matching Can Be Simple And Fast".
  27. Navarro, Gonzalo (10 November 2001). "NR-grep: a fast and flexible pattern-matching tool" (PDF). Software: Practice and Experience. 31 (13): 1265–1312. doi:10.1002/spe.411. S2CID   3175806.