Range concatenation grammar

Last updated January 26, 2024

Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier ^[1] in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the mildly context-sensitive languages.^[2]

Though intended as a variant on Groenink's literal movement grammars (LMGs), RCGs treat the grammatical process more as a proof than as a production. Whereas LMGs produce a terminal string from a start predicate, RCGs aim to reduce a start predicate (which predicates of a terminal string) to the empty string, which constitutes a proof of the terminal strings membership in the language.

Description

Formal definition

A Positive Range Concatenation Grammar (PRCG) is a tuple $G=(N,~T,~V,~S,~P)$ , where:

$N$ , $T$ and $V$ are disjoint finite sets of (respectively) predicate names, terminal symbols and variable names. Each predicate name has an associated arity given by the function $\dim :N\rightarrow \mathbb {N} \setminus \{0\}$ .
$S\in N$ is the start predicate name and verify $\dim(S)=1$ .
$P$ is a finite set of clauses of the form $\psi _{0}\rightarrow \psi _{1}\ldots \psi _{m}$ , where the $\psi _{i}$ are predicates of the form $A_{i}(\alpha _{1},\ldots ,\alpha _{\dim(A_{i})})$ with $A_{i}\in N$ and $\alpha _{i}\in (T\cup V)^{\star }$ .

A Negative Range Concatenation Grammar (NRCG) is defined like a PRCG, but with the addition that some predicates occurring in the right-hand side of a clause can have the form ${\overline {A_{i}(\alpha _{1},\ldots ,\alpha _{\dim(A_{i})})}}$ . Such predicates are called negative predicates.

A Range Concatenation Grammar is a positive or a negative one. Although PRCGs are technically NRCGs, the terms are used to highlight the absence (PRCG) or presence (NRCG) of negative predicates.

A range in a word $w\in T^{\star }$ is a couple $\langle l,r\rangle _{w}$ , with $0\leq l\leq r\leq n$ , where $n$ is the length of $w$ . Variables bind to ranges, not to arbitrary strings of nonterminals. Two ranges $\langle l_{1},r_{1}\rangle _{w}$ and $\langle l_{2},r_{2}\rangle _{w}$ can be concatenated iff $r_{1}=l_{2}$ , and we then have: $\langle l_{1},r_{1}\rangle _{w}\cdot \langle l_{2},r_{2}\rangle _{w}=\langle l_{1},r_{2}\rangle _{w}$ . When instantiating a clause, where an argument consists of multiple elements from $T\cup V$ , their ranges must concatenate.

For a word $w=w_{1}w_{2}\ldots w_{n}$ , with $w_{i}\in T$ , the dotted notation for ranges is: $\langle l,r\rangle _{w}=w_{1}\ldots w_{l-1}\bullet w_{l}\ldots w_{r-1}\bullet w_{r}\ldots w_{n}$ .

Recognition of strings

The strings of predicates being rewritten represent constraints that the string being tested has to satisfy (if positive), or in the case of negative predicates not satisfy. The order of predicates is irrelevant. Rewrite steps amount to replacing one constraint by zero or more simpler constraints.

Like LMGs, RCG clauses have the general schema $A(x_{1},...,x_{n})\to \alpha$ , where in an RCG, $\alpha$ is either the empty string or a string of predicates. The arguments $x_{i}$ consist of strings of terminal symbols and/or variable symbols, which pattern match against actual argument values like in LMG. Adjacent variables constitute a family of matches against partitions, so that the argument $xy$ , with two variables, matches the literal string $ab$ in three different ways: $x=\epsilon ,\ y=ab;\ x=a,\ y=b;\ x=ab,\ y=\epsilon$ . These would give rise to three different instantiations of the clause containing that argument $xy$ .

Predicate terms come in two forms, positive (which produce the empty string on success), and negative (which produce the empty string on failure/if the positive term does not produce the empty string). Negative terms are denoted the same as positive terms, with an overbar, as in ${\overline {A(x_{1},...,x_{n})}}$ .

The rewrite semantics for RCGs is rather simple, identical to the corresponding semantics of LMGs. Given a predicate string $A(\alpha _{1},...,\alpha _{n})$ , where the symbols $\alpha _{i}$ are terminal strings, if there is a rule $A(x_{1},...,x_{n})\to \beta$ in the grammar that the predicate string matches, the predicate string is replaced by $\beta$ , substituting for the matched variables in each $x_{i}$ .

For example, given the rule $A(x,ayb)\to B(axb,y)$ , where $x$ and $y$ are variable symbols and $a$ and $b$ are terminal symbols, the predicate string $A(a,abb)$ can be rewritten as $B(aab,b)$ , because $A(a,abb)$ matches $A(x,ayb)$ when $x=a,\ y=b$ . Similarly, if there were a rule $A(x,ayb)\to A(x,x)\ A(y,y)$ , $A(a,abb)$ could be rewritten as $A(a,a)\ A(b,b)$ .

A proof/recognition of a string $\alpha$ is done by showing that $S(\alpha )$ produces the empty string. For the individual rewrite steps, when multiple alternative variable matches are possible, any rewrite which could lead the whole proof to succeed is considered. Thus, if there is at least one way to produce the empty string from the initial string $S(\alpha )$ , the proof is considered a success, regardless of how many other ways to fail exist.

Example

RCGs are capable of recognizing the non-linear index language $\{www:w\in \{a,b\}^{*}\}$ as follows:

Letting x, y, and z be variable symbols:

{\begin{aligned}S(xyz)&\to A(x,y,z)\\A(ax,ay,az)&\to A(x,y,z)\\A(bx,by,bz)&\to A(x,y,z)\\A(\epsilon ,\epsilon ,\epsilon )&\to \epsilon \end{aligned}}

The proof for abbabbabb is then

$S(abbabbabb)\Rightarrow A(abb,abb,abb)\Rightarrow A(bb,bb,bb)\Rightarrow A(b,b,b)\Rightarrow A(\epsilon ,\epsilon ,\epsilon )\Rightarrow \epsilon$

Or, using the more correct dotted notation for ranges:

$S(\bullet {}abbabbabb\bullet {})\Rightarrow A(\bullet {}abb\bullet {}abbabb,abb\bullet {}abb\bullet {}abb,abbabb\bullet {}abb\bullet {})\Rightarrow A(a\bullet {}bb\bullet {}abbabb,abba\bullet {}bb\bullet {}abb,abbabba\bullet {}bb\bullet {})$ $\Rightarrow A(ab\bullet {}b\bullet {}abbabb,abbab\bullet {}b\bullet {}abb,abbabbab\bullet {}b\bullet {})\Rightarrow A(\epsilon ,\epsilon ,\epsilon )\Rightarrow \epsilon$

For a string of $3n$ letters, there are ${\binom {3n+2}{2}}={\frac {(3n+2)(3n+1)}{2}}$ different instantiations of that first clause, but only the one which makes $x,y,z$ all $n$ letters each allows the derivation to reach $\epsilon$ .

Properties

Every context-free grammar (CFG) can be converted into a range concatenation grammar:

For every nonterminal $A$ of the CFG, the RCG has an arity $1$ predicate $A(x)$ .
For every CFG rule $A\to BC$ , the RCG has $A(xy)\to B(x)C(y)$ .
For every CFG rule $A\to a$ (where $a$ terminal), the RCG has $A(a)\to \epsilon$ .

The intersection and union of two range concatenation languages are trivially range concatenation languages:

For $S$ the intersection of $A$ and $B$ , you have $S(x)\to A(x)B(x)$ .
For $S$ the union of $A$ and $B$ , you have $S(x)\to A(x)$ and $S(x)\to B(x)$ .

Possibly negative range concatenation languages are also closed under set complement.

A consequence of the above is that it is undecidable whether a (positive) range concatenation language is nonempty, because it is undecidable whether the intersection of two context-free languages is nonempty. Hence range concatenation grammars are not generative.

Related Research Articles

In computational complexity theory, bounded-error quantum polynomial time (BQP) is the class of decision problems solvable by a quantum computer in polynomial time, with an error probability of at most 1/3 for all instances. It is the quantum analogue to the complexity class BPP.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In physics, the S-matrix or scattering matrix relates the initial state and the final state of a physical system undergoing a scattering process. It is used in quantum mechanics, scattering theory and quantum field theory (QFT).

In physics the Lamb shift, named after Willis Lamb, refers to an anomalous difference in energy between two electron orbitals in a hydrogen atom. The difference was not predicted by theory and it cannot be derived from the Dirac equation, which predicts identical energies. Hence the Lamb shift refers to a deviation from theory seen in the differing energies contained by the ²S_1/2 and ²P_1/2 orbitals of the hydrogen atom.

In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language and a suffix. For instance, $can be recognized as a sum because it can be broken into, also a sum, and, a suitable suffix.$

In general relativity, the Gibbons–Hawking–York boundary term is a term that needs to be added to the Einstein–Hilbert action when the underlying spacetime manifold has a boundary.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

The time-evolving block decimation (TEBD) algorithm is a numerical scheme used to simulate one-dimensional quantum many-body systems, characterized by at most nearest-neighbour interactions. It is dubbed Time-evolving Block Decimation because it dynamically identifies the relevant low-dimensional Hilbert subspaces of an exponentially larger original Hilbert space. The algorithm, based on the Matrix Product States formalism, is highly efficient when the amount of entanglement in the system is limited, a requirement fulfilled by a large class of quantum many-body systems in one dimension.

Photon polarization is the quantum mechanical description of the classical polarized sinusoidal plane electromagnetic wave. An individual photon can be described as having right or left circular polarization, or a superposition of the two. Equivalently, a photon can be described as having horizontal or vertical linear polarization, or a superposition of the two.

Interaction nets are a graphical model of computation devised by Yves Lafont in 1990 as a generalisation of the proof structures of linear logic. An interaction net system is specified by a set of agent types and a set of interaction rules. Interaction nets are an inherently distributed model of computation in the sense that computations can take place simultaneously in many parts of an interaction net, and no synchronisation is needed. The latter is guaranteed by the strong confluence property of reduction in this model of computation. Thus interaction nets provide a natural language for massive parallelism. Interaction nets are at the heart of many implementations of the lambda calculus, such as efficient closed reduction and optimal, in Lévy's sense, Lambdascope.

In Riemannian geometry, Gauss's lemma asserts that any sufficiently small sphere centered at a point in a Riemannian manifold is perpendicular to every geodesic through the point. More formally, let M be a Riemannian manifold, equipped with its Levi-Civita connection, and p a point of M. The exponential map is a mapping from the tangent space at p to M:

In cryptography, learning with errors (LWE) is a mathematical problem that is widely used to create secure encryption algorithms. It is based on the idea of representing secret information as a set of equations with errors. In other words, LWE is a way to hide the value of a secret by introducing noise to it. In more technical terms, it refers to the computational problem of inferring a linear $-ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.$

Head grammar (HG) is a grammar formalism introduced in Carl Pollard (1984) as an extension of the context-free grammar class of grammars. Head grammar is therefore a type of phrase structure grammar, as opposed to a dependency grammar. The class of head grammars is a subset of the linear context-free rewriting systems.

In linguistics and theoretical computer science, literal movement grammars (LMGs) are a grammar formalism intended to characterize certain extraposition phenomena of natural language such as topicalization and cross-serial dependency. LMGs extend the class of context free grammars (CFGs) by adding introducing pattern-matched function-like rewrite semantics, as well as the operations of variable binding and slash deletion. LMGs were introduced by A.V. Groenink in 1995.

Generalized context-free grammar (GCFG) is a grammar formalism that expands on context-free grammars by adding potentially non-context-free composition functions to rewrite rules. Head grammar is an instance of such a GCFG which is known to be especially adept at handling a wide variety of non-CF properties of natural language.

Coherent states have been introduced in a physical context, first as quasi-classical states in quantum mechanics, then as the backbone of quantum optics and they are described in that spirit in the article Coherent states. However, they have generated a huge variety of generalizations, which have led to a tremendous amount of literature in mathematical physics. In this article, we sketch the main directions of research on this line. For further details, we refer to several existing surveys.

In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator.

In accelerator physics, the Courant–Snyder parameters are a set of quantities used to describe the distribution of positions and velocities of the particles in a beam. When the positions along a single dimension and velocities along that dimension of every particle in a beam are plotted on a phase space diagram, an ellipse enclosing the particles can be given by the equation:

References

↑ Boullier, Pierre (Jan 1998). Proposal for a Natural Language Processing Syntactic Backbone (PDF) (Technical report). Vol. 3342. INRIA Rocquencourt (France).
↑ Pierre Boullier (1999). "Chinese Numbers, MIX, Scrambling, and Range Concatenation Grammars" (PDF). Proc. EACL. pp. 53–60. Archived from the original (PDF) on 2003-05-15.
↑ Eberhard Bertsch and Mark-Jan Nederhof (Oct 2001). "On the complexity of some extensions of RCG parsing" (PDF). Proceedings of the Seventh International Workshop on Parsing Technologies (Beijing). pp. 66–77.
↑ Laura Kallmeyer (2010). Parsing Beyond Context-Free Grammars. Springer Science & Business Media. p. 37. ISBN 978-3-642-14846-0. citing Bertsch, Nederhof (2001)^[3]

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[boullier1998-1] Boullier, Pierre (Jan 1998). Proposal for a Natural Language Processing Syntactic Backbone (PDF) (Technical report). Vol. 3342. INRIA Rocquencourt (France).

[boullier1999-2] Pierre Boullier (1999). "Chinese Numbers, MIX, Scrambling, and Range Concatenation Grammars" (PDF). Proc. EACL. pp. 53–60. Archived from the original (PDF) on 2003-05-15.

[3] Eberhard Bertsch and Mark-Jan Nederhof (Oct 2001). "On the complexity of some extensions of RCG parsing" (PDF). Proceedings of the Seventh International Workshop on Parsing Technologies (Beijing). pp. 66–77.

[Kallmeyer2010-4] Laura Kallmeyer (2010). Parsing Beyond Context-Free Grammars. Springer Science & Business Media. p. 37. ISBN 978-3-642-14846-0. citing Bertsch, Nederhof (2001)^[3]

[1]

[2]

[4]

[3]