Parikh's theorem

Last updated

Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language. [1] It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar. [2] It was first proved by Rohit Parikh in 1961 [3] and republished in 1966. [4]

Contents

Definitions and formal statement

Let be an alphabet. The Parikh vector of a word is defined as the function , given by [1]

where denotes the number of occurrences of the letter in the word .

A subset of is said to be linear if it is of the form

for some vectors . A subset of is said to be semi-linear if it is a union of finitely many linear subsets.

Theorem  Let be a context-free language or a regular language, let be the set of Parikh vectors of words in , that is, . Then is a semi-linear set.

If is any semi-linear set, then there exists a regular language (which a fortiori is context-free) whose Parikh vectors is .

In short, the image under of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.

Two languages are said to be commutatively equivalent if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.

Proof

The second part is easy to prove.

Proof

Given semi-linear set , to construct a regular language whose set of Parikh vectors is .

is a union of 0 or more linear sets. Since the empty language is regular, and union of regular languages is regular, it suffices to prove that any linear set is the set of Parikh vectors of a regular language.

Let , then it is the set of Parikh vectors of , where each has Parikh vector .

The first part is less easy. The following proof is credited to Goldstine. [5]

First we need a small strengthening of the pumping lemma for context-free languages:

Lemma  If is generated by a Chomsky normal form grammar, then , such that

For any , and for any with , there exists a way to split into segments , and a nonterminal symbol , such that

for all , and

The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find copies of some nonterminal symbol in the longest path in the shortest derivation tree.

Now we prove the first part of Parikh's theorem, making use of the above lemma.

Proof

First, construct a Chomsky normal form grammar for .

For each finite nonempty subset of nonterminals , define to be the set of sentences in such that there exists a derivation that uses every nonterminal in , no more and no less. It is clear that , so it suffices to prove that each is a semilinear set.

Now fix some , and let . We construct two finite sets , such that , which is obviously semilinear.

For notational clarity, write to mean "there exists a derivation using no more (but possibly less) than nonterminals in . With that, we define as follows:

To prove , we induct on the length of .

If , then , so . Otherwise, by the strengthened pumping lemma, there exists a derivation of using precisely the elements of , and is of the form
where , , and .
Since there are only elements in , but there are sub-derivations in the middle, we may delete one sub-derivation , and obtain a shorter that is still in , with
By induction, , and by construction, , so .

To prove , we induct on the length of .

If then by construction, .
Otherwise, for some and .
By induction, for some . By construction of , there exists some such that .
By construction of , the symbol appears in a derivation of using exactly all of . Then we can interpolate into that derivation to obtain some such that

Strengthening for bounded languages

A language is bounded if for some fixed words . Ginsburg and Spanier [6] gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.

Call a linear set stratified, if in its definition for each the vector has the property that it has at most two non-zero coordinates, and for each if each of the vectors has two non-zero coordinates, and , respectively, then their order is not. A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

Ginsburg-Spanier  A bounded language is context-free if and only if is a stratified semi-linear set.

Significance

The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a regular language and that some context-free languages can only have ambiguous grammars [ further explanation needed ]. Such languages are called inherently ambiguous languages . From a formal grammar perspective, this means that some ambiguous context-free grammars cannot be converted to equivalent unambiguous context-free grammars.

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In mathematics, and more specifically in linear algebra, a linear map is a mapping between two vector spaces that preserves the operations of vector addition and scalar multiplication. The same names and the same definition are also used for the more general case of modules over a ring; see Module homomorphism.

In mathematics, a product is the result of multiplication, or an expression that identifies objects to be multiplied, called factors. For example, 21 is the product of 3 and 7, and is the product of and . When one factor is an integer, the product is called a multiple.

In mathematics, the tensor product of two vector spaces V and W is a vector space to which is associated a bilinear map that maps a pair to an element of denoted .

<span class="mw-page-title-main">Vector space</span> Algebraic structure in linear algebra

In mathematics and physics, a vector space is a set whose elements, often called vectors, may be added together and multiplied ("scaled") by numbers called scalars. Scalars are often real numbers, but can be complex numbers or, more generally, elements of any field. The operations of vector addition and scalar multiplication must satisfy certain requirements, called vector axioms. Real vector space and complex vector space are kinds of vector spaces based on different kinds of scalars: real coordinate space or complex coordinate space.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

<span class="mw-page-title-main">Normal (geometry)</span> Line or vector perpendicular to a curve or a surface

In geometry, a normal is an object that is perpendicular to a given object. For example, the normal line to a plane curve at a given point is the line perpendicular to the tangent line to the curve at the point.

In mathematics, a linear form is a linear map from a vector space to its field of scalars.

In mathematics, differential forms provide a unified approach to define integrands over curves, surfaces, solids, and higher-dimensional manifolds. The modern notion of differential forms was pioneered by Élie Cartan. It has many applications, especially in geometry, topology and physics.

In mathematics, the adele ring of a global field is a central object of class field theory, a branch of algebraic number theory. It is the restricted product of all the completions of the global field and is an example of a self-dual topological ring.

In mathematics, in particular in algebraic topology, differential geometry and algebraic geometry, the Chern classes are characteristic classes associated with complex vector bundles. They have since become fundamental concepts in many branches of mathematics and physics, such as string theory, Chern–Simons theory, knot theory, Gromov–Witten invariants. Chern classes were introduced by Shiing-Shen Chern.

In mathematics, the Hodge star operator or Hodge star is a linear map defined on the exterior algebra of a finite-dimensional oriented vector space endowed with a nondegenerate symmetric bilinear form. Applying the operator to an element of the algebra produces the Hodge dual of the element. This map was introduced by W. V. D. Hodge.

In mathematics, differential algebra is, broadly speaking, the area of mathematics consisting in the study of differential equations and differential operators as algebraic objects in view of deriving properties of differential equations and operators without computing the solutions, similarly as polynomial algebras are used for the study of algebraic varieties, which are solution sets of systems of polynomial equations. Weyl algebras and Lie algebras may be considered as belonging to differential algebra.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In the theory of formal languages, Ogden's lemma is a generalization of the pumping lemma for context-free languages.

In mathematics and mathematical physics, raising and lowering indices are operations on tensors which change their type. Raising and lowering indices are a form of index manipulation in tensor expressions.

In computer science, partial application refers to the process of fixing a number of arguments of a function, producing another function of smaller arity. Given a function , we might fix the first argument, producing a function of type . Evaluation of this function might be represented as . Note that the result of partial function application in this case is a function that takes two arguments. Partial application is sometimes incorrectly called currying, which is a related, but distinct concept.

Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the mildly context-sensitive languages.

Controlled grammars are a class of grammars that extend, usually, the context-free grammars with additional controls on the derivations of a sentence in the language. A number of different kinds of controlled grammars exist, the four main divisions being Indexed grammars, grammars with prescribed derivation sequences, grammars with contextual conditions on rule application, and grammars with parallelism in rule application. Because indexed grammars are so well established in the field, this article will address only the latter three kinds of controlled grammars.

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

References

  1. 1 2 Kozen, Dexter (1997). Automata and Computability. New York: Springer-Verlag. ISBN   3-540-78105-6.
  2. Håkan Lindqvist. "Parikh's theorem" (PDF). Umeå Universitet.
  3. Parikh, Rohit (1961). "Language Generating Devices". Quarterly Progress Report, Research Laboratory of Electronics, MIT.
  4. Parikh, Rohit (1966). "On Context-Free Languages". Journal of the Association for Computing Machinery . 13 (4): 570–581. doi: 10.1145/321356.321364 . S2CID   12263468.
  5. Goldstine, J. (1977-01-01). "A simplified proof of parikh's theorem". Discrete Mathematics. 19 (3): 235–239. doi:10.1016/0012-365X(77)90103-0. ISSN   0012-365X.
  6. Ginsburg, Seymour; Spanier, Edwin H. (1966). "Presburger formulas, and languages". Pacific Journal of Mathematics. 16 (2): 285–296. doi: 10.2140/pjm.1966.16.285 .