Pattern matching

Last updated

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations (if any) of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence (i.e., search and replace).

Contents

Sequence patterns (e.g., a text string) are often described using regular expressions and matched using techniques such as backtracking.

Tree patterns are used in some programming languages as a general tool to process data based on its structure, e.g. C#, [1] F#, [2] Haskell, [3] ML, Python, [4] Ruby, [5] Rust, [6] Scala, [7] Swift [8] and the symbolic mathematics language Mathematica have special syntax for expressing tree patterns and a language construct for conditional execution and value retrieval based on it.

Often it is possible to give alternative patterns that are tried one by one, which yields a powerful conditional programming construct. Pattern matching sometimes includes support for guards.[ citation needed ]

History

Early programming languages with pattern matching constructs include COMIT (1957), SNOBOL (1962), Refal (1968) with tree-based pattern matching, Prolog (1972), SASL (1976), NPL (1977), and KRC (1981).

Many text editors support pattern matching of various kinds: the QED editor supports regular expression search, and some versions of TECO support the OR operator in searches.

Computer algebra systems generally support pattern matching on algebraic expressions. [9]

Primitive patterns

The simplest pattern in pattern matching is an explicit value or a variable. For an example, consider a simple function definition in Haskell syntax (function parameters are not in parentheses but are separated by spaces, = is not assignment but definition):

f0=1

Here, 0 is a single value pattern. Now, whenever f is given 0 as argument the pattern matches and the function returns 1. With any other argument, the matching and thus the function fail. As the syntax supports alternative patterns in function definitions, we can continue the definition extending it to take more generic arguments:

fn=n*f(n-1)

Here, the first n is a single variable pattern, which will match absolutely any argument and bind it to name n to be used in the rest of the definition. In Haskell (unlike at least Hope), patterns are tried in order so the first definition still applies in the very specific case of the input being 0, while for any other argument the function returns n * f (n-1) with n being the argument.

The wildcard pattern (often written as _) is also simple: like a variable name, it matches any value, but does not bind the value to any name. Algorithms for matching wildcards in simple string-matching situations have been developed in a number of recursive and non-recursive varieties. [10]

Tree patterns

More complex patterns can be built from the primitive ones of the previous section, usually in the same way as values are built by combining other values. The difference then is that with variable and wildcard parts, a pattern doesn't build into a single value, but matches a group of values that are the combination of the concrete elements and the elements that are allowed to vary within the structure of the pattern.

A tree pattern describes a part of a tree by starting with a node and specifying some branches and nodes and leaving some unspecified with a variable or wildcard pattern. It may help to think of the abstract syntax tree of a programming language and algebraic data types.

In Haskell, the following line defines an algebraic data type Color that has a single data constructor ColorConstructor that wraps an integer and a string.

dataColor=ColorConstructorIntegerString

The constructor is a node in a tree and the integer and string are leaves in branches.

When we want to write functions to make Color an abstract data type, we wish to write functions to interface with the data type, and thus we want to extract some data from the data type, for example, just the string or just the integer part of Color.

If we pass a variable that is of type Color, how can we get the data out of this variable? For example, for a function to get the integer part of Color, we can use a simple tree pattern and write:

integerPart(ColorConstructortheInteger_)=theInteger

As well:

stringPart(ColorConstructor_theString)=theString

The creations of these functions can be automated by Haskell's data record syntax.

This Ocaml example which defines a red-black tree and a function to re-balance it after element insertion shows how to match on a more complex structure generated by a recursive data type. The compiler verifies at compile-time that the list of cases is exhaustive and none are redundant.

typecolor=Red|Blacktype'atree=Empty|Treeofcolor*'atree*'a*'atreeletrebalancet=matchtwith|Tree(Black,Tree(Red,Tree(Red,a,x,b),y,c),z,d)|Tree(Black,Tree(Red,a,x,Tree(Red,b,y,c)),z,d)|Tree(Black,a,x,Tree(Red,Tree(Red,b,y,c),z,d))|Tree(Black,a,x,Tree(Red,b,y,Tree(Red,c,z,d)))->Tree(Red,Tree(Black,a,x,b),y,Tree(Black,c,z,d))|_->t(* the 'catch-all' case if no previous pattern matches *)

Filtering data with patterns

Pattern matching can be used to filter data of a certain structure. For instance, in Haskell a list comprehension could be used for this kind of filtering:

[Ax|Ax<-[A1,B1,A2,B2]]

evaluates to

[A 1, A 2]

Pattern matching in Mathematica

In Mathematica, the only structure that exists is the tree, which is populated by symbols. In the Haskell syntax used thus far, this could be defined as

dataSymbolTree=SymbolString[SymbolTree]

An example tree could then look like

Symbol"a"[Symbol"b"[],Symbol"c"[]]

In the traditional, more suitable syntax, the symbols are written as they are and the levels of the tree are represented using [], so that for instance a[b,c] is a tree with a as the parent, and b and c as the children.

A pattern in Mathematica involves putting "_" at positions in that tree. For instance, the pattern

A[_]

will match elements such as A[1], A[2], or more generally A[x] where x is any entity. In this case, A is the concrete element, while _ denotes the piece of tree that can be varied. A symbol prepended to _ binds the match to that variable name while a symbol appended to _ restricts the matches to nodes of that symbol. Note that even blanks themselves are internally represented as Blank[] for _ and Blank[x] for _x.

The Mathematica function Cases filters elements of the first argument that match the pattern in the second argument: [11]

Cases[{a[1],b[1],a[2],b[2]},a[_]]

evaluates to

{a[1],a[2]}

Pattern matching applies to the structure of expressions. In the example below,

Cases[{a[b],a[b,c],a[b[c],d],a[b[c],d[e]],a[b[c],d,e]},a[b[_],_]]

returns

{a[b[c],d],a[b[c],d[e]]}

because only these elements will match the pattern a[b[_],_] above.

In Mathematica, it is also possible to extract structures as they are created in the course of computation, regardless of how or where they appear. The function Trace can be used to monitor a computation, and return the elements that arise which match a pattern. For example, we can define the Fibonacci sequence as

fib[0|1]:=1fib[n_]:=fib[n-1]+fib[n-2]

Then, we can ask the question: Given fib[3], what is the sequence of recursive Fibonacci calls?

Trace[fib[3],fib[_]]

returns a structure that represents the occurrences of the pattern fib[_] in the computational structure:

{fib[3],{fib[2],{fib[1]},{fib[0]}},{fib[1]}}

Declarative programming

In symbolic programming languages, it is easy to have patterns as arguments to functions or as elements of data structures. A consequence of this is the ability to use patterns to declaratively make statements about pieces of data and to flexibly instruct functions how to operate.

For instance, the Mathematica function Compile can be used to make more efficient versions of the code. In the following example the details do not particularly matter; what matters is that the subexpression {{com[_], Integer}} instructs Compile that expressions of the form com[_] can be assumed to be integers for the purposes of compilation:

com[i_]:=Binomial[2i,i]Compile[{x,{i,_Integer}},x^com[i],{{com[_],Integer}}]

Mailboxes in Erlang also work this way.

The Curry–Howard correspondence between proofs and programs relates ML-style pattern matching to case analysis and proof by exhaustion.

Pattern matching and strings

By far the most common form of pattern matching involves strings of characters. In many programming languages, a particular syntax of strings is used to represent regular expressions, which are patterns describing string characters.

However, it is possible to perform some string pattern matching within the same framework that has been discussed throughout this article.

Tree patterns for strings

In Mathematica, strings are represented as trees of root StringExpression and all the characters in order as children of the root. Thus, to match "any amount of trailing characters", a new wildcard ___ is needed in contrast to _ that would match only a single character.

In Haskell and functional programming languages in general, strings are represented as functional lists of characters. A functional list is defined as an empty list, or an element constructed on an existing list. In Haskell syntax:

[]-- an empty listx:xs-- an element x constructed on a list xs

The structure for a list with some elements is thus element:list. When pattern matching, we assert that a certain piece of data is equal to a certain pattern. For example, in the function:

head(element:list)=element

We assert that the first element of head's argument is called element, and the function returns this. We know that this is the first element because of the way lists are defined, a single element constructed onto a list. This single element must be the first. The empty list would not match the pattern at all, as an empty list does not have a head (the first element that is constructed).

In the example, we have no use for list, so we can disregard it, and thus write the function:

head(element:_)=element

The equivalent Mathematica transformation is expressed as

head[element, ]:=element

Example string patterns

In Mathematica, for instance,

StringExpression["a",_]

will match a string that has two characters and begins with "a".

The same pattern in Haskell:

['a',_]

Symbolic entities can be introduced to represent many different classes of relevant features of a string. For instance,

StringExpression[LetterCharacter, DigitCharacter]

will match a string that consists of a letter first, and then a number.

In Haskell, guards could be used to achieve the same matches:

[letter,digit]|isAlphaletter&&isDigitdigit

The main advantage of symbolic string manipulation is that it can be completely integrated with the rest of the programming language, rather than being a separate, special purpose subunit. The entire power of the language can be leveraged to build up the patterns themselves or analyze and transform the programs that contain them.

SNOBOL

SNOBOL (StriNg Oriented and symBOlic Language) is a computer programming language developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky.

SNOBOL4 stands apart from most programming languages by having patterns as a first-class data type (i.e. a data type whose values can be manipulated in all ways permitted to any other data type in the programming language) and by providing operators for pattern concatenation and alternation. Strings generated during execution can be treated as programs and executed.

SNOBOL was quite widely taught in larger US universities in the late 1960s and early 1970s and was widely used in the 1970s and 1980s as a text manipulation language in the humanities.

Since SNOBOL's creation, newer languages such as Awk and Perl have made string manipulation by means of regular expressions fashionable. SNOBOL4 patterns, however, subsume BNF grammars, which are equivalent to context-free grammars and more powerful than regular expressions. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Erlang (programming language)</span> Programming language

Erlang is a general-purpose, concurrent, functional high-level programming language, and a garbage-collected runtime system. The term Erlang is used interchangeably with Erlang/OTP, or Open Telecom Platform (OTP), which consists of the Erlang runtime system, several ready-to-use components (OTP) mainly written in Erlang, and a set of design principles for Erlang programs.

Icon is a very high-level programming language based on the concept of "goal-directed execution" in which code returns a "success" along with valid values, or a "failure", indicating that there is no valid data to return. The success and failure of a given block of code is used to direct further processing, whereas conventional languages would typically use boolean logic written by the programmer to achieve the same ends. Because the logic for basic control structures is often implicit in Icon, common tasks can be completed with less explicit code.

In programming language theory, lazy evaluation, or call-by-need, is an evaluation strategy which delays the evaluation of an expression until its value is needed and which also avoids repeated evaluations.

ML is a functional programming language. It is known for its use of the polymorphic Hindley–Milner type system, which automatically assigns the types of most expressions without requiring explicit type annotations, and ensures type safety – there is a formal proof that a well-typed ML program does not cause runtime type errors. ML provides pattern matching for function arguments, garbage collection, imperative programming, call-by-value and currying. While a general-purpose programming language, ML is used heavily in programming language research and is one of the few languages to be completely specified and verified using formal semantics. Its types and pattern matching make it well-suited and commonly used to operate on other formal languages, such as in compiler writing, automated theorem proving, and formal verification.

SNOBOL is a series of programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky, culminating in SNOBOL4. It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.

OCaml is a general-purpose, high-level multi-paradigm programming language which extends the Caml dialect of ML with object-oriented features. OCaml was created in 1996 by Xavier Leroy, Jérôme Vouillon, Damien Doligez, Didier Rémy, Ascánder Suárez, and others.

Generic programming is a style of computer programming in which algorithms are written in terms of data types to-be-specified-later that are then instantiated when needed for specific types provided as parameters. This approach, pioneered by the ML programming language in 1973, permits writing common functions or types that differ only in the set of types on which they operate when used, thus reducing duplicate code.

<span class="mw-page-title-main">F Sharp (programming language)</span> Microsoft programming language

F# is a functional-first, general-purpose, strongly typed, multi-paradigm programming language that encompasses functional, imperative, and object-oriented programming methods. It is most often used as a cross-platform Common Language Infrastructure (CLI) language on .NET, but can also generate JavaScript and graphics processing unit (GPU) code.

In computer programming, especially functional programming and type theory, an algebraic data type (ADT) is a kind of composite type, i.e., a type formed by combining other types.

In computer science, a tagged union, also called a variant, variant record, choice type, discriminated union, disjoint union, sum type or coproduct, is a data structure used to hold a value that could take on several different, but fixed, types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type that has several "cases", each of which should be handled correctly when that type is manipulated. This is critical in defining recursive datatypes, in which some component of a value may have the same type as that value, for example in defining a type for representing trees, where it is necessary to distinguish multi-node subtrees and leaves. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

<span class="mw-page-title-main">Conditional (computer programming)</span> Control flow statement that executes code according to some condition(s)

In computer science, conditionals are programming language commands for handling decisions. Specifically, conditionals perform different computations or actions depending on whether a programmer-defined Boolean condition evaluates to true or false. In terms of control flow, the decision is always achieved by selectively altering the control flow based on some condition . Although dynamic dispatch is not usually classified as a conditional construct, it is another way to select between alternatives at runtime. Conditional statements are the checkpoints in the programe that determines behaviour according to situation.

In computer programming, operators are constructs defined within programming languages which behave generally like functions, but which differ syntactically or semantically.

In computer programming, a guard is a boolean expression that must evaluate to true if the program execution is to continue in the branch in question. Regardless of which programming language is used, a guard clause, guard code, or guard statement, is a check of integrity preconditions used to avoid errors during execution.

In computer programming, homoiconicity is a property of some programming languages. A language is homoiconic if a program written in it can be manipulated as data using the language, and thus the program's internal representation can be inferred just by reading the program itself. This property is often summarized by saying that the language treats code as data.

In computer programming, an anonymous function is a function definition that is not bound to an identifier. Anonymous functions are often arguments being passed to higher-order functions or used for constructing the result of a higher-order function that needs to return a function. If the function is only used once, or a limited number of times, an anonymous function may be syntactically lighter than using a named function. Anonymous functions are ubiquitous in functional programming languages and other languages with first-class functions, where they fulfil the same role for the function type as literals do for other data types.

Refal "is a functional programming language oriented toward symbolic computations", including "string processing, language translation, [and] artificial intelligence". It is one of the oldest members of this family, first conceived of in 1966 as a theoretical tool, with the first implementation appearing in 1968. Refal was intended to combine mathematical simplicity with practicality for writing large and sophisticated programs.

This article describes the features in the programming language Haskell.

Nemerle is a general-purpose, high-level, statically typed programming language designed for platforms using the Common Language Infrastructure (.NET/Mono). It offers functional, object-oriented, aspect-oriented, reflective and imperative features. It has a simple C#-like syntax and a powerful metaprogramming system.

Racket has been under active development as a vehicle for programming language research since the mid-1990s, and has accumulated many features over the years. This article describes and demonstrates some of these features. Note that one of Racket's main design goals is to accommodate creating new languages, both domain-specific languages and completely new languages. Therefore, some of the following examples are in different languages, but they are all implemented in Racket. Please refer to the main article for more information.

<span class="mw-page-title-main">Elm (programming language)</span> Functional programming language

Elm is a domain-specific programming language for declaratively creating web browser-based graphical user interfaces. Elm is purely functional, and is developed with emphasis on usability, performance, and robustness. It advertises "no runtime exceptions in practice", made possible by the Elm compiler's static type checking.

References

  1. "Pattern Matching - C# Guide".
  2. "Pattern Matching - F# Guide".
  3. A Gentle Introduction to Haskell: Patterns
  4. "What's New In Python 3.10 — Python 3.10.0b3 documentation". docs.python.org. Retrieved 2021-07-06.
  5. "pattern_matching - Documentation for Ruby 3.0.0". docs.ruby-lang.org. Retrieved 2021-07-06.
  6. "Pattern Syntax - The Rust Programming language".
  7. "Pattern Matching". Scala Documentation. Retrieved 2021-01-17.
  8. "Patterns — The Swift Programming Language (Swift 5.1)".
  9. Joel Moses, "Symbolic Integration", MIT Project MAC MAC-TR-47, December 1967
  10. Cantatore, Alessandro (2003). "Wildcard matching algorithms".
  11. "Cases—Wolfram Language Documentation". reference.wolfram.com. Retrieved 2020-11-17.
  12. Gimpel, J. F. 1973. A theory of discrete patterns and their implementation in SNOBOL4. Commun. ACM 16, 2 (Feb. 1973), 91–100. DOI=http://doi.acm.org/10.1145/361952.361960.