Krauss wildcard-matching algorithm

Last updated

In computer science, the Krauss wildcard-matching algorithm is a pattern matching algorithm. Based on the wildcard syntax in common use, e.g. in the Microsoft Windows command-line interface, the algorithm provides a non-recursive mechanism for matching patterns in software applications, based on syntax simpler than that typically offered by regular expressions.

Contents

History

The algorithm is based on a history of development, correctness and performance testing, and programmer feedback that began with an unsuccessful search for a reliable non-recursive algorithm for matching wildcards. An initial algorithm, implemented in a single while loop, quickly prompted comments from software developers, leading to improvements. [1] Ongoing comments and suggestions [2] [3] culminated in a revised algorithm still implemented in a single while loop but refined based on a collection of test cases and a performance profiler. [4] The experience tuning the single while loop using the profiler prompted development of a two-loop strategy that achieved further performance gains, particularly in situations involving empty input strings or input containing no wildcard characters. [5] The two-loop algorithm is available for use by the open-source software development community, under the terms of the Apache License v. 2.0, and is accompanied by test case code.

Usage

The algorithm made available under the Apache license is implemented in both pointer-based C++ and portable C++ (implemented without pointers). The test case code, also available under the Apache license, can be applied to any algorithm that provides the pattern matching operations below. The implementation as coded is unable to handle multibyte character sets and poses problems when the text being searched may contain multiple incompatible character sets.

Pattern matching operations

The algorithm supports three pattern matching operations:

Examples

Applications

The original algorithm has been ported to the DataFlex programming language by Larry Heiges [6] for use with Data Access Worldwide code library. It has been posted on GitHub in modified form as part of a log file reader. [7] The 2014 algorithm is part of the Unreal Model Viewer built into the Epic Games Unreal Engine game engine. [8] [9]

See also

Related Research Articles

Regular expression Sequence of characters that forms a search pattern

A regular expression, regex or regexp is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

SNOBOL is a series of programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky, culminating in SNOBOL4. It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.

In software, a wildcard character is a kind of placeholder represented by a single character, such as an asterisk, which can be interpreted as a number of literal characters or an empty string. It is often used in file searches so the full name need not be typed.

Standard ML (SML) is a general-purpose, modular, functional programming language with compile-time type checking and type inference. It is popular among compiler writers and programming language researchers, as well as in the development of theorem provers.

F Sharp (programming language) Microsoft programming language

F# is a general purpose, strongly typed, multi-paradigm programming language that encompasses functional, imperative, and object-oriented programming methods. F# is most often used as a cross-platform Common Language Infrastructure (CLI) language, but it can also generate JavaScript and graphics processing unit (GPU) code.

In computer science, the Knuth–Morris–Pratt string-searching algorithm searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

In the C++ programming language, the C++ Standard Library is a collection of classes and functions, which are written in the core language and part of the C++ ISO Standard itself.

In computer programming, glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txt textfiles/ moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard standing for "any string of characters" and *.txt is a glob pattern. The other common wildcard is the question mark (?), which stands for one character.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The algorithm preprocesses the string being searched for, but not the string being searched in. It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists across multiple searches. The Boyer–Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms. In general, the algorithm runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

wildmat is a pattern matching library developed by Rich Salz. Based on the wildcard syntax already used in the Bourne shell, wildmat provides a uniform mechanism for matching patterns across applications with simpler syntax than that typically offered by regular expressions. Patterns are implicitly anchored at the beginning and end of each string when testing for a match.

In Unix-like and some other operating systems, find is a command-line utility that locates files based on some user-specified criteria and then applies some requested action on each matched object.

Perl Compatible Regular Expressions (PCRE) is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language. Philip Hazel started writing PCRE in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and than that of many other regular-expression libraries.

In computer science, the Boyer–Moore–Horspool algorithm or Horspool's algorithm is an algorithm for finding substrings in strings. It was published by Nigel Horspool in 1980 as SBM.

Recursion (computer science) Use of functions that call themselves (on smaller inputs)

Recursion in computer science is a method of solving a problem where the solution depends on solutions to smaller instances of the same problem. Such problems can generally be solved by iteration, but this needs to identify and index the smaller instances at programming time. At the opposite, recursion solves such recursive problems by using functions that call themselves from within their own code. The approach can be applied to many types of problems, and recursion is one of the central ideas of computer science.

The power of recursion evidently lies in the possibility of defining an infinite set of objects by a finite statement. In the same manner, an infinite number of computations can be described by a finite recursive program, even if this program contains no explicit repetitions.

In computer science, empirical algorithmics is the practice of using empirical methods to study the behavior of algorithms. The practice combines algorithm development and experimentation: algorithms are not just designed, but also implemented and tested in a variety of situations. In this process, an initial design of an algorithm is analyzed so that the algorithm may be developed in a stepwise manner.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

Gestalt Pattern Matching, also Ratcliff/Obershelp Pattern Recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex also includes a fast regular expression library to optimize pattern matching in any C++ application.

References

  1. Krauss, Kirk (2008). "Matching Wildcards: An Algorithm". Dr. Dobb's Journal .
  2. "wild card searching". alt.os.development. 2008.
  3. T.J. (2014). "wild card matching in text string". Stack Overflow.
  4. Krauss, Kirk (2014). "Matching Wildcards: An Empirical Way to Tame an Algorithm". Dr. Dobb's Journal .
  5. Krauss, Kirk (2018). "Matching Wildcards: An Improved Algorithm for Big Data". Develop for Performance.
  6. Heiges, Larry (2008). "Text compare function - generalTextCompare.txt". Data Access Worldwide Code Library.
  7. Deniskore (2013). "Deniskore/wildcard/CLogReader.cpp". Popular repositories. GitHub. Lines 173-279.
  8. gildor2 (2016). "UModel/Core/Core.cpp". Unreal Engine Model Viewer (UE Viewer). GitHub. Lines 334-435.
  9. gildor2 (2016). "History for UModel/Core/Core.cpp". Unreal Engine Model Viewer (UE Viewer).