Grammar induction

Last updated

Grammar induction (or grammatical inference) [1] is the process in machine learning of learning a formal grammar (usually as a collection of re-write rules or productions or alternatively as a finite state machine or automaton of some kind) from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

Contents

Grammar classes

Grammatical inference has often been very focused on the problem of learning finite state machines of various types (see the article Induction of regular languages for details on these approaches), since there have been efficient algorithms for this problem since the 1980s.

Since the beginning of the century, these approaches have been extended to the problem of inference of context-free grammars and richer formalisms, such as multiple context-free grammars and parallel multiple context-free grammars. Other classes of grammars for which grammatical inference has been studied are combinatory categorial grammars, [2] stochastic context-free grammars, [3] contextual grammars and pattern languages.

Learning models

The simplest form of learning is where the learning algorithm merely receives a set of examples drawn from the language in question: the aim is to learn the language from examples of it (and, rarely, from counter-examples, that is, example that do not belong to the language). However, other learning models have been studied. One frequently studied alternative is the case where the learner can ask membership queries as in the exact query learning model or minimally adequate teacher model introduced by Angluin. [4]

Methodologies

There is a wide variety of methods for grammatical inference. Two of the classic sources are Fu (1977) and Fu (1982). Duda, Hart & Stork (2001) also devote a brief section to the problem, and cite a number of references. The basic trial-and-error method they present is discussed below. For approaches to infer subclasses of regular languages in particular, see Induction of regular languages . A more recent textbook is de la Higuera (2010), [1] which covers the theory of grammatical inference of regular languages and finite state automata. D'Ulizia, Ferri and Grifoni [5] provide a survey that explores grammatical inference methods for natural languages.

Induction of probabilistic grammars

There are several methods for induction of probabilistic context-free grammars. [6] [7] [ further explanation needed ]

Grammatical inference by trial-and-error

The method proposed in Section 8.7 of Duda, Hart & Stork (2001) suggests successively guessing grammar rules (productions) and testing them against positive and negative observations. The rule set is expanded so as to be able to generate each positive example, but if a given rule set also generates a negative example, it must be discarded. This particular approach can be characterized as "hypothesis testing" and bears some similarity to Mitchel's version space algorithm. The Duda, Hart & Stork (2001) text provide a simple example which nicely illustrates the process, but the feasibility of such an unguided trial-and-error approach for more substantial problems is dubious.

Grammatical inference by genetic algorithms

Grammatical induction using evolutionary algorithms is the process of evolving a representation of the grammar of a target language through some evolutionary process. Formal grammars can easily be represented as tree structures of production rules that can be subjected to evolutionary operators. Algorithms of this sort stem from the genetic programming paradigm pioneered by John Koza.[ citation needed ] Other early work on simple formal languages used the binary string representation of genetic algorithms, but the inherently hierarchical structure of grammars couched in the EBNF language made trees a more flexible approach.

Koza represented Lisp programs as trees. He was able to find analogues to the genetic operators within the standard set of tree operators. For example, swapping sub-trees is equivalent to the corresponding process of genetic crossover, where sub-strings of a genetic code are transplanted into an individual of the next generation. Fitness is measured by scoring the output from the functions of the Lisp code. Similar analogues between the tree structured lisp representation and the representation of grammars as trees, made the application of genetic programming techniques possible for grammar induction.

In the case of grammar induction, the transplantation of sub-trees corresponds to the swapping of production rules that enable the parsing of phrases from some language. The fitness operator for the grammar is based upon some measure of how well it performed in parsing some group of sentences from the target language. In a tree representation of a grammar, a terminal symbol of a production rule corresponds to a leaf node of the tree. Its parent nodes corresponds to a non-terminal symbol (e.g. a noun phrase or a verb phrase) in the rule set. Ultimately, the root node might correspond to a sentence non-terminal.

Grammatical inference by greedy algorithms

Like all greedy algorithms, greedy grammar inference algorithms make, in iterative manner, decisions that seem to be the best at that stage. The decisions made usually deal with things like the creation of new rules, the removal of existing rules, the choice of a rule to be applied or the merging of some existing rules. Because there are several ways to define 'the stage' and 'the best', there are also several greedy grammar inference algorithms.

These context-free grammar generating algorithms make the decision after every read symbol:

These context-free grammar generating algorithms first read the whole given symbol-sequence and then start to make decisions:

Distributional learning

A more recent approach is based on distributional learning. Algorithms using these approaches have been applied to learning context-free grammars and mildly context-sensitive languages and have been proven to be correct and efficient for large subclasses of these grammars. [8]

Learning of pattern languages

Angluin defines a pattern to be "a string of constant symbols from Σ and variable symbols from a disjoint set". The language of such a pattern is the set of all its nonempty ground instances i.e. all strings resulting from consistent replacement of its variable symbols by nonempty strings of constant symbols. [note 1] A pattern is called descriptive for a finite input set of strings if its language is minimal (with respect to set inclusion) among all pattern languages subsuming the input set.

Angluin gives a polynomial algorithm to compute, for a given input string set, all descriptive patterns in one variable x. [note 2] To this end, she builds an automaton representing all possibly relevant patterns; using sophisticated arguments about word lengths, which rely on x being the only variable, the state count can be drastically reduced. [9]

Erlebach et al. give a more efficient version of Angluin's pattern learning algorithm, as well as a parallelized version. [10]

Arimura et al. show that a language class obtained from limited unions of patterns can be learned in polynomial time. [11]

Pattern theory

Pattern theory, formulated by Ulf Grenander, [12] is a mathematical formalism to describe knowledge of the world as patterns. It differs from other approaches to artificial intelligence in that it does not begin by prescribing algorithms and machinery to recognize and classify patterns; rather, it prescribes a vocabulary to articulate and recast the pattern concepts in precise language.

In addition to the new algebraic vocabulary, its statistical approach was novel in its aim to:

Broad in its mathematical coverage, pattern theory spans algebra and statistics, as well as local topological and global entropic properties.

Applications

The principle of grammar induction has been applied to other aspects of natural language processing, and has been applied (among many other problems) to semantic parsing, [2] natural language understanding, [13] example-based translation, [14] language acquisition, [15] grammar-based compression, [16] and anomaly detection. [17]

Compression algorithms

Straight-line grammar (with start symbol ss) for the second sentence of the United States Declaration of Independence. Each blue character denotes a nonterminal symbol; they were obtained from a gzip-compression of the sentence. IndpGrm.gif
Straight-line grammar (with start symbol ß) for the second sentence of the United States Declaration of Independence. Each blue character denotes a nonterminal symbol; they were obtained from a gzip-compression of the sentence.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. [18] To compress a data sequence , a grammar-based code transforms into a context-free grammar . The problem of finding a smallest grammar for an input sequence (smallest grammar problem) is known to be NP-hard, [19] so many grammar-transform algorithms are proposed from theoretical and practical viewpoints.

Generally, the produced grammar is further compressed by statistical encoders like arithmetic coding.

See also

Notes

  1. The language of a pattern with at least two occurrences of the same variable is not regular due to the pumping lemma.
  2. x may occur several times, but no other variable y may occur

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

<span class="mw-page-title-main">L-system</span> Rewriting system and type of formal grammar

An L-system or Lindenmayer system is a parallel rewriting system and a type of formal grammar. An L-system consists of an alphabet of symbols that can be used to make strings, a collection of production rules that expand each symbol into some larger string of symbols, an initial "axiom" string from which to begin construction, and a mechanism for translating the generated strings into geometric structures. L-systems were introduced and developed in 1968 by Aristid Lindenmayer, a Hungarian theoretical biologist and botanist at the University of Utrecht. Lindenmayer used L-systems to describe the behaviour of plant cells and to model the growth processes of plant development. L-systems have also been used to model the morphology of a variety of organisms and can be used to generate self-similar fractals.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

In computer science, a Van Wijngaarden grammar is a formalism for defining formal languages. The name derives from the formalism invented by Adriaan van Wijngaarden for the purpose of defining the ALGOL 68 programming language. The resulting specification remains its most notable application.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. To compress a data sequence , a grammar-based code transforms into a context-free grammar . The problem of finding a smallest grammar for an input sequence is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar is further compressed by statistical encoders like arithmetic coding.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

Grammatical evolution (GE) is an evolutionary computation and, more specifically, a genetic programming (GP) technique (or approach) pioneered by Conor Ryan, JJ Collins and Michael O'Neill in 1998 at the BDS Group in the University of Limerick.

Artificial grammar learning (AGL) is a paradigm of study within cognitive psychology and linguistics. Its goal is to investigate the processes that underlie human language learning by testing subjects' ability to learn a made-up grammar in a laboratory setting. It was developed to evaluate the processes of human language learning but has also been utilized to study implicit learning in a more general sense. The area of interest is typically the subjects' ability to detect patterns and statistical regularities during a training phase and then use their new knowledge of those patterns in a testing phase. The testing phase can either use the symbols or sounds used in the training phase or transfer the patterns to another set of symbols or sounds as surface structure.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

Dana Angluin is a professor emeritus of computer science at Yale University. She is known for foundational work in computational learning theory and distributed computing.

The term "string grammar" in computational linguistics refers to the structure of a specific language, such that it can be formatted as a single continuous string of text, without the need to have line-breaks to alter the meaning. The appearance of any text in "column 1" of a line does not change the meaning of that text in a string grammar. A string grammar can be used to describe the structure of some natural languages, such as English or French, as well as for some computer languages.

The following outline is provided as an overview of and topical guide to natural-language processing:

In theoretical computer science, a pattern language is a formal language that can be defined as the set of all particular instances of a string of constants and variables. Pattern Languages were introduced by Dana Angluin in the context of machine learning.

In computational learning theory, induction of regular languages refers to the task of learning a formal description of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way, approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction.

The following outline is provided as an overview of and topical guide to machine learning:

References

  1. 1 2 de la Higuera, Colin (2010). Grammatical Inference: Learning Automata and Grammars (PDF). Cambridge: Cambridge University Press. Archived from the original (PDF) on 2019-02-14. Retrieved 2017-08-16.
  2. 1 2 Kwiatkowski, Tom, et al. "Lexical generalization in CCG grammar induction for semantic parsing." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011.
  3. Clark, Alexander. "Unsupervised induction of stochastic context-free grammars using distributional clustering." Proceedings of the 2001 workshop on Computational Natural Language Learning-Volume 7. Association for Computational Linguistics, 2001.
  4. Dana Angluin (1987). "Learning Regular Sets from Queries and Counter-Examples" (PDF). Information and Control . 75 (2): 87–106. CiteSeerX   10.1.1.187.9414 . doi:10.1016/0890-5401(87)90052-6. S2CID   11873053. Archived from the original (PDF) on 2013-12-02.
  5. D’Ulizia, A., Ferri, F., Grifoni, P. (2011) "A Survey of Grammatical Inference Methods for Natural Language Learning [ dead link ]", Artificial Intelligence Review, Vol. 36, No. 1, pp. 1–27.
  6. Talton, Jerry, et al. "Learning design patterns with bayesian grammar induction." Proceedings of the 25th annual ACM symposium on User interface software and technology. 2012.
  7. Kim, Yoon, Chris Dyer, and Alexander M. Rush. "Compound probabilistic context-free grammars for grammar induction." arXiv preprint arXiv:1906.10225 (2019).
  8. Clark and Eyraud (2007) Journal of Machine Learning Research; Ryo Yoshinaka (2011) Theoretical Computer Science
  9. Dana Angluin (1980). "Finding Patterns Common to a Set of Strings". Journal of Computer and System Sciences. 21: 46–62. doi: 10.1016/0022-0000(80)90041-0 .
  10. T. Erlebach; P. Rossmanith; H. Stadtherr; A. Steger; T. Zeugmann (1997). "Learning One-Variable Pattern Languages Very Efficiently on Average, in Parallel, and by Asking Queries". In M. Li; A. Maruoka (eds.). Proc. 8th International Workshop on Algorithmic Learning Theory — ALT'97. LNAI. Vol. 1316. Springer. pp. 260–276.
  11. Hiroki Arimura; Takeshi Shinohara; Setsuko Otsuki (1994). "Finding Minimal Generalizations for Unions of Pattern Languages and Its Application to Inductive Inference from Positive Data" (PDF). Proc. STACS 11. LNCS. Vol. 775. Springer. pp. 649–660.[ dead link ]
  12. Grenander, Ulf, and Michael I. Miller. Pattern theory: from representation to inference .[ dead link ] Vol. 1. Oxford: Oxford university press, 2007.
  13. Miller, Scott, et al. "Hidden understanding models of natural language." Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1994.
  14. Brown, Ralf D. "Transfer-rule induction for example-based translation." Proceedings of the MT Summit VIII Workshop on Example-Based Machine Translation. 2001.
  15. Chater, Nick, and Christopher D. Manning. "Probabilistic models of language processing and acquisition." Trends in cognitive sciences 10.7 (2006): 335-344.
  16. Cherniavsky, Neva, and Richard Ladner. "Grammar-based compression of DNA sequences." DIMACS Working Group on The Burrows–Wheeler Transform 21 (2004).
  17. Senin, Pavel, et al. "Time series anomaly discovery with grammar-based compression." Edbt. 2015.
  18. Kieffer, J. C.; Yang, E.-H. (2000), "Grammar-based codes: A new class of universal lossless source codes", IEEE Trans. Inf. Theory, 46 (3): 737–754, doi:10.1109/18.841160
  19. Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabharakan, M.; Sahai, A.; Shelat, A. (2005), "The Smallest Grammar Problem", IEEE Trans. Inf. Theory, 51 (7): 2554–2576, doi:10.1109/tit.2005.850116, S2CID   6900082

Sources