Matching wildcards

Last updated

In computer science, an algorithm for matching wildcards (also known as globbing) is useful in comparing text strings that may contain wildcard syntax. [1] Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell [2] or Microsoft Windows command-line [3] or text editor or file manager, as well as the interfaces for some search engines [4] and databases. [5] Wildcard matching is a subset of the problem of matching regular expressions and string matching in general. [6]

Contents

The problem

A wildcard matcher tests a wildcard pattern p against an input string s. It performs an anchored match, returns true only when p matches the entirety of s.

The pattern can be based on any common syntax (see globbing), but on Windows programmers tend to only discuss a simplified syntax supported by the native C runtime: [7] [8]

This article mainly discusses the Windows formulation of the problem, unless otherwise stated.

Definition

Stated in zero-based indices, the wildcard-matching problem can be defined recursively as:

where mij is the result of matching the pattern p against the text t truncated at i and j characters respectively. This is the formulation used by Richter's algorithm and the Snippets algorithm found in Cantatore's collection. [9] [10] This description is similar to the Levenshtein distance.

Directly related problems in computer science include:

History

Early algorithms for matching wildcards often relied on recursion, but the technique was criticized on grounds of performance [10] and reliability [8] considerations. Non-recursive algorithms for matching wildcards have gained favor in light of these considerations.

Among both recursive and non-recursive algorithms, strategies for performing the pattern matching operation vary widely, as evidenced among the variety of example algorithms referenced below. Test case development and performance optimization techniques have been demonstrably brought to bear on certain algorithms, particularly those developed by critics of the recursive algorithms.

Recursive algorithms

The recursion generally happens on matching * when there is more suffix to match against. This is a form of backtracking, also done by some regular expression matchers.

The general form of these algorithms are the same. On recursion the algorithm slices the input into substrings, and considers a match to have happened when ONE of the substrings return a positive match. For dowild("*X", "abcX"), it would greedily call dowild("X", "abcX"), dowild("X", "bcX"), dowild("X", "cX") and dowild("X", "X"). They usually differ by less-important things like support for features and by more important factors such as minor but highly effective optimizations. Some of them include:

Martin Richter's algorithm is an exception to this pattern, although the overall operation is equivalent. On * it recurses into increasing either of the indexes, following the dynamic programming formulation of the problem. The "ABORT" technique is applicable to it as well. [9] On typical patterns (as tested by Cantatore) it is slower than the greedy-call implementations. [10]

The recursive algorithms are in general easier to reason about, and with the ABORT modification they perform acceptably in terms of worst-case complexity. On strings without * they take linear-to-string-size time to match since there is a fixed one-to-one relation.

Non-recursive algorithms

The following are developed by critics of the recursive algorithms:

The following is not:

The iterative functions above implement backtracking by saving an old set of pattern/text pointers, and reverting to it should a match fails. According to Kurt, since only one successful match is required, only one such set needs to be saved. [17]

In addition, the problem of wildcard matching can be converted into regular expression matching using a naive text-replacement approach. Although non-recursive regular expression matchers such as Thompson's construction are less used in practice due to lack of backreference support, wildcard matching in general does not come with a similarly rich set of features. (In fact, many of the algorithms above only has support for ? and *.) The Russ Cox implementation of Thompson NFA can be trivially modified for such. [26] Gustavo Navarro's BDM-based nrgrep algorithm provides a more streamlined implementation with emphasis on efficient suffixes. [27] See also regular expression § Implementations.

See also

Related Research Articles

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In software, a wildcard character is a kind of placeholder represented by a single character, such as an asterisk, which can be interpreted as a number of literal characters or an empty string. It is often used in file searches so the full name need not be typed.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txttextfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters ".

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

wildmat is a pattern matching library developed by Rich Salz. Based on the wildcard syntax already used in the Bourne shell, wildmat provides a uniform mechanism for matching patterns across applications with simpler syntax than that typically offered by regular expressions. Patterns are implicitly anchored at the beginning and end of each string when testing for a match.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

The bitap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

expr is a command line utility on Unix and Unix-like operating systems which evaluates an expression and outputs the corresponding value. It first appeared in Unix v7. The command is available for Microsoft Windows as part of the UnxUtils collection of native Win32 ports of common GNU Unix-like utilities. The expr command has also been ported to the IBM i operating system.

<span class="mw-page-title-main">Recursion (computer science)</span> Use of functions that call themselves

In computer science, recursion is a method of solving a computational problem where the solution depends on solutions to smaller instances of the same problem. Recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

This is a comparison of regular expression engines.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

In computer science, the Krauss wildcard-matching algorithm is a pattern matching algorithm. Based on the wildcard syntax in common use, e.g. in the Microsoft Windows command-line interface, the algorithm provides a non-recursive mechanism for matching patterns in software applications, based on syntax simpler than that typically offered by regular expressions.

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

References

  1. "Wildcard characters". ScienceDirect. 2018.
  2. Quigley, Ellie (2005). UNIX Shell Programming QuickStart. InformIT.com.
  3. "MS-DOS and Windows Wildcard Characters". Microsoft Developer Network Library.
  4. "Apache Lucene - Query Parser Syntax". Apache Lucene 2.9.4 Documentation. 2006.
  5. "SQL Wildcards". W3Schools. 2018.
  6. Goyvaerts, Jan (2018). "Welcome to Regular-Expressions.info". RegularExpressions.info.
  7. "Wildcard Expansion". docs.microsoft.com.
  8. 1 2 3 Krauss, Kirk (2008). "Matching Wildcards: An Algorithm". Dr. Dobb's Journal .
  9. 1 2 3 Deadlock (2015). "Wildcard Matching Recursive Algorithm C++". Stack Overflow.
  10. 1 2 3 4 Cantatore, Alessandro (2003). "Wildcard matching algorithms".
  11. Iliopoulos, Costas S.; Rahman, M. Sohel (2007). "Pattern Matching Algorithms with Don't Cares" (PDF). SOFSEM 2007: Theory and Practice of Computer Science, 33rd Conference on Current Trends in Theory and Practice of Computer Science. Harrachov, Czech Republic. S2CID   14538871. Archived from the original (PDF) on 2019-12-17.
  12. Clifford, Peter; Clifford, Raphaël (January 2007). "Simple deterministic wildcard matching". Information Processing Letters. 101 (2): 53–54. doi:10.1016/j.ipl.2006.08.002.
  13. Wu, Xindong; Qiang, Ji-Peng; Xie, Fei (12 September 2014). "Pattern Matching with Flexible Wildcards". Journal of Computer Science and Technology. 29 (5): 740–750. doi:10.1007/s11390-014-1464-3. S2CID   16824910.
  14. 1 2 Salz, Rich (1991). "wildmat.c". GitHub.
  15. Filip (2014). "Compare strings with wildcard". Stack Overflow.
  16. Murugesan, Vignesh (2014). "WildCard Matching algorithm".
  17. 1 2 3 Kurt, Dogan. "Wildcard Matching Methods".
  18. van Rossum, Guido (20 November 2019). "freebsd/lib/libc/gen/fnmatch.c". GitHub . Retrieved 21 November 2019.
  19. "fnmatch.c". opensource.apple.com. 1999.
  20. "fnmatch_internal.c". Beren Minor's Mirrors. 21 November 2019.
  21. 1 2 "git/git: wildmatch.c". GitHub. 2020-01-20.
  22. 1 2 "uwildmat.c in trunk/lib – INN". inn.eyrie.org. Retrieved 27 November 2019.
  23. Krauss, Kirk (2018). "Matching Wildcards: An Improved Algorithm for Big Data". Develop for Performance.
  24. Siler (2013). "Recursive solutions for glob pattern matching". Stack Overflow.
  25. Handy, Jack (2005). "Wildcard string compare (globbing)". Code Project.
  26. Cox, Ross. "Regular Expression Matching Can Be Simple And Fast".
  27. Navarro, Gonzalo (10 November 2001). "NR-grep: a fast and flexible pattern-matching tool" (PDF). Software: Practice and Experience. 31 (13): 1265–1312. doi:10.1002/spe.411. S2CID   3175806.