Parikh's theorem

Last updated September 09, 2024

Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language.^[1] It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar.^[2] It was first proved by Rohit Parikh in 1961^[3] and republished in 1966.^[4]

Definitions and formal statement

Let $\Sigma =\{a_{1},a_{2},\ldots ,a_{k}\}$ be an alphabet. The Parikh vector of a word is defined as the function ${\textstyle p:\Sigma ^{*}\to \mathbb {N} ^{k}}$ , given by^[1] $p(w)=(|w|_{a_{1}},|w|_{a_{2}},\ldots ,|w|_{a_{k}})$ where $|w|_{a_{i}}$ denotes the number of occurrences of the symbol $a_{i}$ in the word $w$ .

A subset of $\mathbb {N} ^{k}$ is said to be linear if it is of the form $u_{0}+\mathbb {N} u_{1}+\dots +\mathbb {N} u_{m}=\{u_{0}+t_{1}u_{1}+\dots +t_{m}u_{m}\mid t_{1},\ldots ,t_{m}\in \mathbb {N} \}$ for some vectors ${\textstyle u_{0},\ldots ,u_{m}}$ . A subset of $\mathbb {N} ^{k}$ is said to be semi-linear if it is a union of finitely many linear subsets.

Theorem — Let $L$ be a context-free language or a regular language, let $P(L)$ be the set of Parikh vectors of words in $L$ , that is, ${\textstyle P(L)=\{p(w)\mid w\in L\}}$ . Then $P(L)$ is a semi-linear set.

If $S$ is any semi-linear set, then there exists a regular language (which a fortiori is context-free) whose Parikh vectors is $S$ .

In short, the image under $p$ of context-free languages and of regular languages is the same, and it is equal to the set of semilinear sets.

Two languages are said to be commutatively equivalent if they have the same set of Parikh vectors. Thus, every context-free language is commutatively equivalent to some regular language.

Proof

The second part is easy to prove.

Proof

Given semi-linear set $S$ , to construct a regular language whose set of Parikh vectors is $S$ .

$S$ is a union of 0 or more linear sets. Since the empty language is regular, and union of regular languages is regular, it suffices to prove that any linear set is the set of Parikh vectors of a regular language.

Let $S=\{u_{0}+t_{1}u_{1}+\dots +t_{m}u_{m}\mid t_{1},\ldots ,t_{m}\in \mathbb {N} \}$ , then it is the set of Parikh vectors of $\{z_{0}\}\cdot (\cup _{i=1}^{m}\{z_{i}\})^{*}$ , where each $z_{i}$ has Parikh vector $u_{i}$ .

The first part is less easy. The following proof is credited to Goldstine.^[5]

First we need a small strengthening of the pumping lemma for context-free languages:

Lemma — If $L$ is generated by a Chomsky normal form grammar, then $\exists N\geq 1$ , such that

For any $k\geq 1$ , and for any $w\in L$ with $|w|\geq N^{k}$ , there exists a way to split $w$ into segments $ux_{1}\cdots x_{k}zy_{k}\cdots y_{1}v$ , and a nonterminal symbol $A$ , such that

$|x_{i}y_{i}|\geq 1$ for all $i$ , and $|x_{1}\cdots x_{k}zy_{k}\cdots y_{1}|\leq N^{k}$

$S\Rightarrow ^{*}uAv\quad A\Rightarrow ^{*}z\quad \forall i,A\Rightarrow ^{*}x_{i}Ay_{i}$

The proof is essentially the same as the standard pumping lemma: use the pigeonhole principle to find $k$ copies of some nonterminal symbol $A$ in the longest path in the shortest derivation tree.

Now we prove the first part of Parikh's theorem, making use of the above lemma.

Proof

First, construct a Chomsky normal form grammar for $L$ .

For each finite nonempty subset of nonterminals $U$ , define $L_{U}$ to be the set of sentences in $L$ such that there exists a derivation that uses every nonterminal in $U$ , no more and no less. It is clear that $L=\cup _{U}L_{U}$ , so it suffices to prove that each $p(L_{U})$ is a semilinear set.

Now fix some $U$ , and let $k=|U|$ . We construct two finite sets $F,G$ , such that $p(L_{U})=p(F\cdot G^{*})$ , which is obviously semilinear.

For notational clarity, write

\Rightarrow _{U}^{*}

to mean "there exists a derivation using no more (but possibly less) than nonterminals in

U

. With that, we define

F,G

as follows:

$F=\{w\in L_{U}:|w|<N^{k}\}$ $G=\{xy:1\leq |xy|\leq N^{k}{\text{ and there exists }}A\in U{\text{ such that }}A\Rightarrow _{U}^{*}xAy\}$

To prove $p(L_{U})\subset p(F\cdot G^{*})$ , we induct on the length of $w\in L_{U}$ .

If

|w|<N^{k}

, then

w\in F

, so

p(w)\in p(F\cdot G^{*})

. Otherwise, by the strengthened pumping lemma, there exists a derivation of

w

using precisely the elements of

U

, and is of the form

$S{\underset {d_{0}}{\stackrel {*}{\Rightarrow }}}uAv{\underset {d_{1}}{\stackrel {*}{\Rightarrow }}}ux_{1}Ay_{1}v{\underset {d_{2}}{\stackrel {*}{\Rightarrow }}}\cdots {\underset {d_{k}}{\stackrel {*}{\Rightarrow }}}ux_{1}\cdots x_{k}Ay_{k}\cdots y_{1}v{\underset {d_{k+1}}{\stackrel {*}{\Rightarrow }}}ux_{1}\cdots x_{k}zy_{k}\cdots y_{1}v$

where

A\in U

,

1\leq |x_{i}y_{i}|

, and

|x_{1}\cdots x_{k}zy_{k}\cdots y_{1}|\leq N^{k}

.

Since there are only

k-1

elements in

U\setminus \{A\}

, but there are

k

sub-derivations

d_{1},...,d_{k}

in the middle, we may delete one sub-derivation

d_{i}

, and obtain a shorter

w'

that is still in

L_{U}

, with

$p(w)=p(uzv)+p(x_{1}y_{1})+\cdots +p(x_{k}y_{k})=p(w')+p(x_{i}y_{i})$

By induction,

p(w')\in p(F\cdot G^{*})

, and by construction,

x_{i}y_{i}\in G

, so

p(w)\in p(F\cdot G^{*})

.

To prove $p(L_{U})\supset p(F\cdot G^{*})$ , consider an element $w\in F\cdot G^{*}$ . We need to show that $p(w)\in p(L_{U})$ . We induct on the minimal number of factors from $G$ that is needed to identify $w$ as an element of $F\cdot G^{*}$ .

If no factor from

G

is needed, then

w\in F\subset L_{U}

.

Otherwise,

w=w'xy

for some

w'\in F\cdot G^{*}

and

xy\in G

, where

w'

requires less factors from

G

than

w

.

By induction,

p(w')=p(w'')

for some

w''\in L_{U}

. By construction of

G

, there exists some

A\in U

such that

A\Rightarrow _{U}^{*}xAy

.

By construction of

L_{U}

, the symbol

A

appears in a derivation of

w''

using exactly all of

U

. Then we can interpolate

A\Rightarrow _{U}^{*}xAy

into that derivation to obtain some

w'''\in L_{U}

such that

$p(w''')=p(w'')+p(xy)=p(w')+p(xy)=p(w)$

Strengthening for bounded languages

A language $L$ is bounded if $L\subset w_{1}^{*}\ldots w_{k}^{*}$ for some fixed words $w_{1},\ldots ,w_{k}$ . Ginsburg and Spanier ^[6] gave a necessary and sufficient condition, similar to Parikh's theorem, for bounded languages.

Call a linear set stratified, if in its definition for each $i\geq 1$ the vector $u_{i}$ has the property that it has at most two non-zero coordinates, and for each $i,j\geq 1$ if each of the vectors $u_{i},u_{j}$ has two non-zero coordinates, $i_{1}<i_{2}$ and $j_{1}<j_{2}$ , respectively, then their order is not $i_{1}<j_{1}<i_{2}<j_{2}$ . A semi-linear set is stratified if it is a union of finitely many stratified linear subsets.

Ginsburg-Spanier — A bounded language $L$ is context-free if and only if $\{(n_{1},\ldots ,n_{k})\mid w_{1}^{n_{1}}\ldots w_{k}^{n_{k}}\in L\}$ is a stratified semi-linear set.

Significance

The theorem has multiple interpretations. It shows that a context-free language over a singleton alphabet must be a regular language and that some context-free languages can only have ambiguous grammars ^{[ further explanation needed ]}. Such languages are called inherently ambiguous languages . From a formal grammar perspective, this means that some ambiguous context-free grammars cannot be converted to equivalent unambiguous context-free grammars.

Related Research Articles

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

The Hahn–Banach theorem is a central tool in functional analysis. It allows the extension of bounded linear functionals defined on a vector subspace of some vector space to the whole space, and it also shows that there are "enough" continuous linear functionals defined on every normed vector space to make the study of the dual space "interesting". Another version of the Hahn–Banach theorem is known as the Hahn–Banach separation theorem or the hyperplane separation theorem, and has numerous uses in convex geometry.

In mathematics, and more specifically in linear algebra, a linear map is a mapping $between two vector spaces that preserves the operations of vector addition and scalar multiplication. The same names and the same definition are also used for the more general case of modules over a ring; see Module homomorphism.$

In mathematics, a normed vector space or normed space is a vector space over the real or complex numbers on which a norm is defined. A norm is a generalization of the intuitive notion of "length" in the physical world. If $is a vector space over, where is a field equal to or to, then a norm on is a map, typically denoted by, satisfying the following four axioms:$

Non-negativity: for every $, .$
Positive definiteness: for every $, if and only if is the zero vector.$
Absolute homogeneity: for every $and,$
Triangle inequality: for every $and,$

In mathematics, a product is the result of multiplication, or an expression that identifies objects to be multiplied, called factors. For example, 21 is the product of 3 and 7, and $is the product of and . When one factor is an integer, the product is called a multiple .$

In mathematics, the tensor product $of two vector spaces V and W is a vector space to which is associated a bilinear map that maps a pair to an element of denoted ⁠ ⁠ .$

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

A mathematical symbol is a figure or a combination of figures that is used to represent a mathematical object, an action on mathematical objects, a relation between mathematical objects, or for structuring the other symbols that occur in a formula. As formulas are entirely constituted with symbols of various types, many symbols are needed for expressing all mathematics.

In vector calculus, the divergence theorem, also known as Gauss's theorem or Ostrogradsky's theorem, is a theorem relating the flux of a vector field through a closed surface to the divergence of the field in the volume enclosed.

In mathematics, a linear form is a linear map from a vector space to its field of scalars.

In functional analysis and related areas of mathematics, locally convex topological vector spaces (LCTVS) or locally convex spaces are examples of topological vector spaces (TVS) that generalize normed spaces. They can be defined as topological vector spaces whose topology is generated by translations of balanced, absorbent, convex sets. Alternatively they can be defined as a vector space with a family of seminorms, and a topology can be defined in terms of that family. Although in general such spaces are not necessarily normable, the existence of a convex local base for the zero vector is strong enough for the Hahn–Banach theorem to hold, yielding a sufficiently rich theory of continuous linear functionals.

Verma modules, named after Daya-Nand Verma, are objects in the representation theory of Lie algebras, a branch of mathematics.

In geometry, a three-dimensional space is a mathematical space in which three values (coordinates) are required to determine the position of a point. Most commonly, it is the three-dimensional Euclidean space, that is, the Euclidean space of dimension three, which models physical space. More general three-dimensional spaces are called 3-manifolds. The term may also refer colloquially to a subset of space, a three-dimensional region, a solid figure.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In algebraic geometry, the Chow groups of an algebraic variety over any field are algebro-geometric analogs of the homology of a topological space. The elements of the Chow group are formed out of subvarieties in a similar way to how simplicial or cellular homology groups are formed out of subcomplexes. When the variety is smooth, the Chow groups can be interpreted as cohomology groups and have a multiplication called the intersection product. The Chow groups carry rich information about an algebraic variety, and they are correspondingly hard to compute in general.

In computer science, partial application refers to the process of fixing a number of arguments of a function, producing another function of smaller arity. Given a function $, we might fix the first argument, producing a function of type . Evaluation of this function might be represented as . Note that the result of partial function application in this case is a function that takes two arguments. Partial application is sometimes incorrectly called currying, which is a related, but distinct concept.$

Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the mildly context-sensitive languages.

Controlled grammars are a class of grammars that extend, usually, the context-free grammars with additional controls on the derivations of a sentence in the language. A number of different kinds of controlled grammars exist, the four main divisions being Indexed grammars, grammars with prescribed derivation sequences, grammars with contextual conditions on rule application, and grammars with parallelism in rule application. Because indexed grammars are so well established in the field, this article will address only the latter three kinds of controlled grammars.

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

In mathematics, the injective tensor product of two topological vector spaces (TVSs) was introduced by Alexander Grothendieck and was used by him to define nuclear spaces. An injective tensor product is in general not necessarily complete, so its completion is called the completed injective tensor products. Injective tensor products have applications outside of nuclear spaces. In particular, as described below, up to TVS-isomorphism, many TVSs that are defined for real or complex valued functions, for instance, the Schwartz space or the space of continuously differentiable functions, can be immediately extended to functions valued in a Hausdorff locally convex TVS $with out any need to extend definitions from real/complex-valued functions to -valued functions.$

References

1 2 Kozen, Dexter (1997). Automata and Computability. New York: Springer-Verlag. ISBN 3-540-78105-6.
↑ Håkan Lindqvist. "Parikh's theorem" (PDF). Umeå Universitet.
↑ Parikh, Rohit (1961). "Language Generating Devices". Quarterly Progress Report, Research Laboratory of Electronics, MIT.
↑ Parikh, Rohit (1966). "On Context-Free Languages". Journal of the Association for Computing Machinery . 13 (4): 570–581. doi: 10.1145/321356.321364 . S2CID 12263468.
↑ Goldstine, J. (1977-01-01). "A simplified proof of parikh's theorem". Discrete Mathematics. 19 (3): 235–239. doi:10.1016/0012-365X(77)90103-0. ISSN 0012-365X.
↑ Ginsburg, Seymour; Spanier, Edwin H. (1966). "Presburger formulas, and languages". Pacific Journal of Mathematics. 16 (2): 285–296. doi: 10.2140/pjm.1966.16.285 .

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[kozen-1] 1 2 Kozen, Dexter (1997). Automata and Computability. New York: Springer-Verlag. ISBN 3-540-78105-6.

[2] Håkan Lindqvist. "Parikh's theorem" (PDF). Umeå Universitet.

[3] Parikh, Rohit (1961). "Language Generating Devices". Quarterly Progress Report, Research Laboratory of Electronics, MIT.

[4] Parikh, Rohit (1966). "On Context-Free Languages". Journal of the Association for Computing Machinery . 13 (4): 570–581. doi: 10.1145/321356.321364 . S2CID 12263468.

[5] Goldstine, J. (1977-01-01). "A simplified proof of parikh's theorem". Discrete Mathematics. 19 (3): 235–239. doi:10.1016/0012-365X(77)90103-0. ISSN 0012-365X.

[6] Ginsburg, Seymour; Spanier, Edwin H. (1966). "Presburger formulas, and languages". Pacific Journal of Mathematics. 16 (2): 285–296. doi: 10.2140/pjm.1966.16.285 .

[1]

[2]

[3]

[4]

[5]

[6]