Pumping lemma for context-free languages

Last updated November 05, 2023

In computer science, in particular in formal language theory, the pumping lemma for context-free languages, also known as the Bar-Hillel lemma,^[1] is a lemma that gives a property shared by all context-free languages and generalizes the pumping lemma for regular languages.

Formal statement

If a language $L$ is context-free, then there exists some integer $p\geq 1$ (called a "pumping length")^[2] such that every string $s$ in $L$ that has a length of $p$ or more symbols (i.e. with $|s|\geq p$ ) can be written as

s=uvwxy

with substrings $u,v,w,x$ and $y$ , such that

1.

|vx|\geq 1

,

2.

|vwx|\leq p

, and

3.

uv^{n}wx^{n}y\in L

for all

n\geq 0

.

Below is a formal expression of the Pumping Lemma.

${\begin{array}{l}(\forall L\subseteq \Sigma ^{*})\\\quad ({\mbox{context free}}(L)\Rightarrow \\\quad ((\exists p\geq 1)((\forall s\in L)((|s|\geq p)\Rightarrow \\\quad ((\exists u,v,w,x,y\in \Sigma ^{*})(s=uvwxy\land |vx|\geq 1\land |vwx|\leq p\land (\forall n\geq 0)(uv^{n}wx^{n}y\in L)))))))\end{array}}$

Informal statement and explanation

The pumping lemma for context-free languages (called just "the pumping lemma" for the rest of this article) describes a property that all context-free languages are guaranteed to have.

The property is a property of all strings in the language that are of length at least $p$ , where $p$ is a constant—called the pumping length—that varies between context-free languages.

Say $s$ is a string of length at least $p$ that is in the language.

The pumping lemma states that $s$ can be split into five substrings, $s=uvwxy$ , where $vx$ is non-empty and the length of $vwx$ is at most $p$ , such that repeating $v$ and $x$ the same number of times ( $n$ ) in $s$ produces a string that is still in the language. It is often useful to repeat zero times, which removes $v$ and $x$ from the string. This process of "pumping up" $s$ with additional copies of $v$ and $x$ is what gives the pumping lemma its name.

Finite languages (which are regular and hence context-free) obey the pumping lemma trivially by having $p$ equal to the maximum string length in $L$ plus one. As there are no strings of this length the pumping lemma is not violated.

Usage of the lemma

The pumping lemma is often used to prove that a given language $L$ is non-context-free, by showing that arbitrarily long strings $s$ are in $L$ that cannot be "pumped" without producing strings outside $L$ .

For example, if $S\subset \mathbb {N}$ is infinite but does not contain an (infinite) arithmetic progression, then $L=\{a^{n}:n\in S\}$ is not context-free. In particular, neither the prime numbers nor the square numbers are context-free.

For example, the language $L=\{a^{n}b^{n}c^{n}|n>0\}$ can be shown to be non-context-free by using the pumping lemma in a proof by contradiction. First, assume that $L$ is context free. By the pumping lemma, there exists an integer $p$ which is the pumping length of language $L$ . Consider the string $s=a^{p}b^{p}c^{p}$ in $L$ . The pumping lemma tells us that $s$ can be written in the form $s=uvwxy$ , where $u, v, w, x$ , and $y$ are substrings, such that $|vx|\geq 1$ , $|vwx|\leq p$ , and $uv^{i}wx^{i}y\in L$ for every integer $i\geq 0$ . By the choice of $s$ and the fact that $|vwx|\leq p$ , it is easily seen that the substring $vwx$ can contain no more than two distinct symbols. That is, we have one of five possibilities for $vwx$ :

$vwx=a^{j}$ for some $j\leq p$ .
$vwx=a^{j}b^{k}$ for some $j$ and $k$ with $j+k\leq p$
$vwx=b^{j}$ for some $j\leq p$ .
$vwx=b^{j}c^{k}$ for some $j$ and $k$ with $j+k\leq p$ .
$vwx=c^{j}$ for some $j\leq p$ .

For each case, it is easily verified that $uv^{i}wx^{i}y$ does not contain equal numbers of each letter for any $i\neq 1$ . Thus, $uv^{2}wx^{2}y$ does not have the form $a^{i}b^{i}c^{i}$ . This contradicts the definition of $L$ . Therefore, our initial assumption that $L$ is context free must be false.

While the pumping lemma is often a useful tool to prove that a given language is not context-free, it does not give a complete characterization of the context-free languages. If a language does not satisfy the condition given by the pumping lemma, we have established that it is not context-free. On the other hand, there are languages that are not context-free, but still satisfy the condition given by the pumping lemma, for example

L=\{b^{j}c^{k}d^{l}|j,k,l\in \mathbb {N} \}\cup \{a^{i}b^{j}c^{j}d^{j}|i,j\in \mathbb {N} ,i\geq 1\}

for $s = b j c k d l$ with e.g. j≥1 choose $vwx$ to consist only of b's, for $s = a i b j c j d j$ choose $vwx$ to consist only of a's; in both cases all pumped strings are still in L.^[3]

A precursor of the pumping lemma was used in 1960 by Scheinberg to prove that $L=\{a^{n}b^{n}a^{n}|n>0\}$ is not context-free.^[4]

Related Research Articles

In formal language theory, a context-sensitive language is a language that can be defined by a context-sensitive grammar. Context-sensitive is one of the four types of grammars in the Chomsky hierarchy.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).

The Hahn–Banach theorem is a central tool in functional analysis. It allows the extension of bounded linear functionals defined on a subspace of some vector space to the whole space, and it also shows that there are "enough" continuous linear functionals defined on every normed vector space to make the study of the dual space "interesting". Another version of the Hahn–Banach theorem is known as the Hahn–Banach separation theorem or the hyperplane separation theorem, and has numerous uses in convex geometry.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Szemerédi's regularity lemma is one of the most powerful tools in extremal graph theory, particularly in the study of large dense graphs. It states that the vertices of every large enough graph can be partitioned into a bounded number of parts so that the edges between different parts behave almost randomly.

In the theory of formal languages, the pumping lemma for regular languages is a lemma that describes an essential property of all regular languages. Informally, it says that all sufficiently long strings in a regular language may be pumped—that is, have a middle section of the string repeated an arbitrary number of times—to produce a new string that is also part of the language.

In the theory of formal languages, Ogden's lemma is a generalization of the pumping lemma for context-free languages.

Indexed languages are a class of formal languages discovered by Alfred Aho; they are described by indexed grammars and can be recognized by nested stack automata.

In mathematics, the Grothendieck inequality states that there is a universal constant $with the following property. If M ij is an n \times n matrix with$

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

In mathematics, particularly numerical analysis, the Bramble–Hilbert lemma, named after James H. Bramble and Stephen Hilbert, bounds the error of an approximation of a function $by a polynomial of order at most in terms of derivatives of of order . Both the error of the approximation and the derivatives of are measured by norms on a bounded domain in . This is similar to classical numerical analysis, where, for example, the error of linear interpolation can be bounded using the second derivative of . However, the Bramble-Hilbert lemma applies in any number of dimensions, not just one dimension, and the approximation error and the derivatives of are measured by more general norms involving averages, not just the maximum norm.$

The dominance-based rough set approach (DRSA) is an extension of rough set theory for multi-criteria decision analysis (MCDA), introduced by Greco, Matarazzo and Słowiński. The main change compared to the classical rough sets is the substitution for the indiscernibility relation by a dominance relation, which permits one to deal with inconsistencies typical to consideration of criteria and preference-ordered decision classes.

In mathematics, particularly linear algebra, the Schur–Horn theorem, named after Issai Schur and Alfred Horn, characterizes the diagonal of a Hermitian matrix with given eigenvalues. It has inspired investigations and substantial generalizations in the setting of symplectic geometry. A few important generalizations are Kostant's convexity theorem, Atiyah–Guillemin–Sternberg convexity theorem, Kirwan convexity theorem.

Controlled grammars are a class of grammars that extend, usually, the context-free grammars with additional controls on the derivations of a sentence in the language. A number of different kinds of controlled grammars exist, the four main divisions being Indexed grammars, grammars with prescribed derivation sequences, grammars with contextual conditions on rule application, and grammars with parallelism in rule application. Because indexed grammars are so well established in the field, this article will address only the latter three kinds of controlled grammars.

Parikh's theorem in theoretical computer science says that if one looks only at the number of occurrences of each terminal symbol in a context-free language, without regard to their order, then the language is indistinguishable from a regular language. It is useful for deciding that strings with a given number of terminals are not accepted by a context-free grammar. It was first proved by Rohit Parikh in 1961 and republished in 1966.

In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix and an approximating matrix, subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.

In mathematics and theoretical computer science, a pattern is an unavoidable pattern if it is unavoidable on any finite alphabet.

In the theory of formal languages, the interchange lemma states a necessary condition for a language to be context-free, just like the pumping lemma for context-free languages.

In mathematics, the hypergraph regularity method is a powerful tool in extremal graph theory that refers to the combined application of the hypergraph regularity lemma and the associated counting lemma. It is a generalization of the graph regularity method, which refers to the use of Szemerédi's regularity and counting lemmas.

References

↑ Kreowski, Hans-Jörg (1979). "A pumping lemma for context-free graph languages". In Claus, Volker; Ehrig, Hartmut; Rozenberg, Grzegorz (eds.). Graph-Grammars and Their Application to Computer Science and Biology. Lecture Notes in Computer Science. Vol. 73. Berlin, Heidelberg: Springer. pp. 270–283. doi:10.1007/BFb0025726. ISBN 978-3-540-35091-0.
↑ Berstel, Jean; Lauve, Aaron; Reutenauer, Christophe; Saliola, Franco V. (2009). Combinatorics on words. Christoffel words and repetitions in words (PDF). CRM Monograph Series. Vol. 27. Providence, RI: American Mathematical Society. p. 90. ISBN 978-0-8218-4480-9. Zbl 1161.68043. (Also see [www-igm.univ-mlv.fr/~berstel/ Aaron Berstel's website)
↑ John E. Hopcroft, Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. ISBN 0-201-02988-X. Here: sect.6.1, p.129
↑ Stephen Scheinberg (1960). "Note on the Boolean Properties of Context Free Languages" (PDF). Information and Control. 3 (4): 372–375. doi: 10.1016/s0019-9958(60)90965-7 . Here: Lemma 3, and its use on p.374-375.

Bar-Hillel, Y.; Micha Perles; Eli Shamir (1961). "On formal properties of simple phrase-structure grammars". Zeitschrift für Phonetik, Sprachwissenschaft, und Kommunikationsforschung. 14 (2): 143–172.— Reprinted in: Y. Bar-Hillel (1964). Language and Information: Selected Essays on their Theory and Application. Addison-Wesley series in logic. Addison-Wesley. pp. 116–150. ISBN 0201003732. OCLC 783543642.
Michael Sipser (1997). Introduction to the Theory of Computation. PWS Publishing. ISBN 0-534-94728-X. Section 1.4: Nonregular Languages, pp. 77–83. Section 2.3: Non-context-free Languages, pp. 115–119.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Kreowski, Hans-Jörg (1979). "A pumping lemma for context-free graph languages". In Claus, Volker; Ehrig, Hartmut; Rozenberg, Grzegorz (eds.). Graph-Grammars and Their Application to Computer Science and Biology. Lecture Notes in Computer Science. Vol. 73. Berlin, Heidelberg: Springer. pp. 270–283. doi:10.1007/BFb0025726. ISBN 978-3-540-35091-0.

[BLRS90-2] Berstel, Jean; Lauve, Aaron; Reutenauer, Christophe; Saliola, Franco V. (2009). Combinatorics on words. Christoffel words and repetitions in words (PDF). CRM Monograph Series. Vol. 27. Providence, RI: American Mathematical Society. p. 90. ISBN 978-0-8218-4480-9. Zbl 1161.68043. (Also see [www-igm.univ-mlv.fr/~berstel/ Aaron Berstel's website)

[3] John E. Hopcroft, Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. ISBN 0-201-02988-X. Here: sect.6.1, p.129

[4] Stephen Scheinberg (1960). "Note on the Boolean Properties of Context Free Languages" (PDF). Information and Control. 3 (4): 372–375. doi: 10.1016/s0019-9958(60)90965-7 . Here: Lemma 3, and its use on p.374-375.

[1]

[2]

[3]

[4]