Pattern language (formal languages)

Last updated

In theoretical computer science, a pattern language is a formal language that can be defined as the set of all particular instances of a string of constants and variables. Pattern Languages were introduced by Dana Angluin in the context of machine learning. [1]

Contents

Definition

Given a finite set Σ of constant symbols and a countable set X of variable symbols disjoint from Σ, a pattern is a finite non-empty string of symbols from Σ∪X. The length of a pattern p, denoted by |p|, is just the number of its symbols. The set of all patterns containing exactly n distinct variables (each of which may occur several times) is denoted by Pn, the set of all patterns at all by P*. A substitution is a mapping f: P*P* such that [note 1]

If p = f(q) for some patterns p, qP* and some substitution f, then p is said to be less general thanq, written pq; in that case, necessarily |p| ≥ |q| holds. For a pattern p, its language is defined as the set of all less general patterns that are built from constants only, formally: L(p) = { s ∈ Σ+ : sp }, where Σ+ denotes the set of all finite non-empty strings of symbols from Σ.

For example, using the constants Σ = { 0, 1 } and the variables X = { x, y, z, ... }, the pattern 0x10xx1 ∈P1 and xxyP2 has length 7 and 3, respectively. An instance of the former pattern is 00z100z0z1 and 01z101z1z1, it is obtained by the substitution that maps x to 0z and to 1z, respectively, and each other symbol to itself. Both 00z100z0z1 and 01z101z1z1 are also instances of xxy. In fact, L(0x10xx1) is a subset of L(xxy). The language of the pattern x0 and x1 is the set of all bit strings which denote an even and odd binary number, respectively. The language of xx is the set of all strings obtainable by concatenating a bit string with itself, e.g. 00, 11, 0101, 1010, 11101110 ∈ L(xx).

Properties

NP-hardness of pattern language membership, by reduction from the NP-complete 1-in-3-SAT problem: Given a CNF of m clauses with n variables, a pattern of length 3n+4m+1 with 2n variables and a string of length 4n+5m+1 can be constructed as shown (m=3 and n=4 in the example). Upper-case variables in the pattern correspond to negated variables in the CNF. The string matches the pattern if and only if an assignment exists such that in each clause exactly one literal is 1 (meaning "true" in the CNF). In the left part, e.g. "0wW0" is matched by "01110" just if one of w,W is matched by "1" (corresponding to "false") and the other by "11" (corresponding to "true"), i.e. if w corresponds to the negation of W. In the right part, e.g. "0xYZ0" is matched by "011110" just if exactly one of x,Y,Z is matched by "11" and the others by "1", i.e. if exactly one literal corresponds to "true". NP-hardness of pattern language membership svg.svg
NP-hardness of pattern language membership, by reduction from the NP-complete 1-in-3-SAT problem: Given a CNF of m clauses with n variables, a pattern of length 3n+4m+1 with 2n variables and a string of length 4n+5m+1 can be constructed as shown (m=3 and n=4 in the example). Upper-case variables in the pattern correspond to negated variables in the CNF. The string matches the pattern if and only if an assignment exists such that in each clause exactly one literal is 1 (meaning "true" in the CNF). In the left part, e.g. "0wW0" is matched by "01110" just if one of w,W is matched by "1" (corresponding to "false") and the other by "11" (corresponding to "true"), i.e. if w corresponds to the negation of W. In the right part, e.g. "0xYZ0" is matched by "011110" just if exactly one of x,Y,Z is matched by "11" and the others by "1", i.e. if exactly one literal corresponds to "true".

The problem of deciding whether sL(p) for an arbitrary string s ∈ Σ+ and pattern p is NP-complete (see picture), and so is hence the problem of deciding pq for arbitrary patterns p, q. [2]

The class of pattern languages is not closed under ...

The class of pattern languages is closed under ...

If p, qP1 are patterns containing exactly one variable, then pq if and only if L(p) ⊆ L(q); the same equivalence holds for patterns of equal length. [4] For patterns of different length, the above example p = 0x10xx1 and q = xxy shows that L(p) ⊆ L(q) may hold without implying pq. However, any two patterns p and q, of arbitrary lengths, generate the same language if and only if they are equal up to consistent variable renaming. [5] Each pattern p is a common generalization of all strings in its generated language L(p), modulo associativity of (⋅).

Location in the Chomsky hierarchy

In a refined Chomsky hierarchy, the class of pattern languages is a proper superclass and subclass of the singleton [note 2] and the indexed languages, respectively, but incomparable to the language classes in between; due to the latter, the pattern language class is not explicitly shown in the table below.

The class of pattern languages is incomparable with the class of finite languages, with the class of regular languages, and with the class of context-free languages:

Each singleton language is trivially a pattern language, generated by a pattern without variables.

Each pattern language can be produced by an indexed grammar: For example, using Σ = { a, b, c } and X = { x, y }, the pattern axbycxayb is generated by a grammar with nonterminal symbols N = { Sx, Sy, S } ∪ X, terminal symbols T = Σ, index symbols F = { ax, bx, cx, ay, by, cy }, start symbol Sx, and the following production rules:

Sx[σ]Sx[ax σ]| Sx[bx σ]| Sx[cx σ]| Sy[σ]
Sy[σ]Sy[ay σ]| Sy[by σ]| Sy[cy σ]| S[σ]
S[σ]ax[σ] by[σ] cx[σ] ay[σ] b
x[ax σ]ax[σ]       y[ax σ]y[σ]       
x[bx σ]bx[σ]y[bx σ]y[σ]
x[cx σ]cx[σ]y[cx σ]y[σ]
x[ay σ]x[σ]y[ay σ]ay[σ]
x[by σ]x[σ]y[by σ]by[σ]
x[cy σ]x[σ]y[cy σ]cy[σ]
x[]→ εy[]→ ε

An example derivation is:

Sx[]  Sx[bx]  Sx[axbx]  Sy[axbx]  Sy[cyaxbx]  S[cyaxbx]  ax[cyaxbx] by[cyaxbx] cx[cyaxbx] ay[cyaxbx] b  ax[axbx] by[cyaxbx] cx[cyaxbx] ay[cyaxbx] b  aax[bx] by[cyaxbx] cx[cyaxbx] ay[cyaxbx] b  aabx[] by[cyaxbx] cx[cyaxbx] ay[cyaxbx] b  aabby[cyaxbx] cx[cyaxbx] ay[cyaxbx] b  ⇒ ... ⇒  aabbcy[] cx[cyaxbx] ay[cyaxbx] b  aabbccx[cyaxbx] ay[cyaxbx] b  ⇒ ... ⇒  aabbccabx[] ay[cyaxbx] b  aabbccabay[cyaxbx] b  ⇒ ... ⇒  aabbccabacy[] b  aabbccabacb

In a similar way, an index grammar can be constructed from any pattern.

Learning patterns

Given a sample set S of strings, a pattern p is called descriptive of S if SL(p), but not SL(q) ⊂ L(p) for any other pattern q.

Given any sample set S, a descriptive pattern for S can be computed by

Based on this algorithm, the class of pattern languages can be identified in the limit from positive examples. [7]

Notes

  1. Angluin's notion of substitution differs from the usual notion of string substitution.
  2. i.e. languages consisting of a single string; they correspond to straight-line grammars

Related Research Articles

Context-free grammar Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form

Pushdown automaton Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

String (computer science) Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other,, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

Quadratic function Polynomial function of degree two

In algebra, a quadratic function, a quadratic polynomial, a polynomial of degree 2, or simply a quadratic, is a polynomial function with one or more variables in which the highest-degree term is of the second degree.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

Deterministic finite automaton

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

Categorial grammar is a family of formalisms in natural language syntax which share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz, Yehoshua Bar-Hillel, and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a alphabet. Given a binary relation between fixed strings over the alphabet, called rewrite rules, denoted by , an SRS extends the rewriting relation to all strings in which the left- and right-hand side of the rules appear as substrings, that is , where , , , and are strings.

In mathematics, a variable is a symbol which works as a placeholder for expression or quantities that may vary or change; is often used to represent the argument of a function or an arbitrary element of a set. In addition to numbers, variables are commonly used to represent vectors, matrices and functions.

In formal language theory, an alphabet is a non-empty set of symbols/glyphs, typically thought of as representing letters, characters, or digits but among other possibilities the "symbols" could also be a set of phonemes. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and depending on its purpose maybe be finite, countable, or even uncountable.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In mathematical logic, a term denotes a mathematical object while a formula denotes a mathematical fact. In particular, terms appear as components of a formula. This is analogous to natural language, where a noun phrase refers to an object and a whole sentence refers to a fact.

Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the Mildly context-sensitive languages.

Controlled grammars are a class of grammars that extend, usually, the context-free grammars with additional controls on the derivations of a sentence in the language. A number of different kinds of controlled grammars exist, the four main divisions being Indexed grammars, grammars with prescribed derivation sequences, grammars with contextual conditions on rule application, and grammars with parallelism in rule application. Because indexed grammars are so well established in the field, this article will address only the latter three kinds of controlled grammars.

In mathematics and theoretical computer science, a pattern is an unavoidable pattern if it is unavoidable on any finite alphabet.

In computational learning theory, induction of regular languages refers to the task of learning a formal description of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way, approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction.

References

  1. Dana Angluin (1980). "Finding Patterns Common to a Set of Strings". Journal of Computer and System Sciences. 21: 46–62. doi: 10.1016/0022-0000(80)90041-0 .
  2. Theorem 3.6, p.50; Corollary 3.7, p.52
  3. Theorem 3.10, p.53
  4. Lemma 3.9, p.52; Corollary 3.4, p.50
  5. Theorem 3.5, p.50
  6. Theorem 4.1, p.53
  7. Dana Angluin (1980). "Inductive Inference of Formal Languages from Positive Data" (PDF). Information and Control. 45 (2): 117–135. doi: 10.1016/s0019-9958(80)90285-5 .; here: Example 1, p.125