Straight-line grammar

Last updated January 27, 2025 • 2 min readFrom Wikipedia, The Free Encyclopedia

A straight-line grammar (sometimes abbreviated as SLG) is a formal grammar that generates exactly one string.^[1] Consequently, it does not branch (every non-terminal has only one associated production rule) nor loop (if non-terminal A appears in a derivation of B, then B does not appear in a derivation of A).^[1]

Areas of usefulness

Straight-line grammars are widely used in the development of algorithms that execute directly on compressed structures (without prior decompression).^[2]^: 212

SLGs are of interest in fields like Kolmogorov complexity, Lossless data compression, Structure discovery and Compressed data structures.^{[ clarification needed ]}

The problem of finding a context-free grammar (equivalently: an SLG) of minimal size that generates a given string is called the smallest grammar problem.^{[ citation needed ]}

Straight-line grammars (more precisely: straight-line context-free string grammars) can be generalized to Straight-line context-free tree grammars. The latter can be used conveniently to compress trees.^[2]^: 212

Formal definition

A context-free grammar G is an SLG if:

1. for every non-terminal N, there is at most one production rule that has N as its left-hand side, and

2. the directed graph G=<V,E>, defined by V being the set of non-terminals and (A,B) ∈ E whenever B appears at the right-hand side of a production rule for A, is acyclic.

A mathematical definition of the more general formalism of straight-line context-free tree grammars can be found in Lohrey et al.^[2]^: 215

An SLG in Chomsky normal form is equivalent to a straight-line program.^{[ citation needed ]}

A list of algorithms using SLGs

The Sequitur algorithm constructs a straight-line grammar for a given string.
The Lempel-Ziv-Welch algorithm creates a context-free grammar in such a deterministic way that it is necessary to store only the start rule of the generated grammar.
Byte pair encoding

Related Research Articles

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation in 1968.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

An L-system or Lindenmayer system is a parallel rewriting system and a type of formal grammar. An L-system consists of an alphabet of symbols that can be used to make strings, a collection of production rules that expand each symbol into some larger string of symbols, an initial "axiom" string from which to begin construction, and a mechanism for translating the generated strings into geometric structures. L-systems were introduced and developed in 1968 by Aristid Lindenmayer, a Hungarian theoretical biologist and botanist at the University of Utrecht. Lindenmayer used L-systems to describe the behaviour of plant cells and to model the growth processes of plant development. L-systems have also been used to model the morphology of a variety of organisms and can be used to generate self-similar fractals.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In theoretical linguistics and computational linguistics, probabilistic context free grammars (PCFGs) extend context-free grammars, similar to how hidden Markov models extend regular grammars. Each production is assigned a probability. The probability of a derivation (parse) is the product of the probabilities of the productions used in that derivation. These probabilities can be viewed as parameters of the model, and for large problems it is convenient to learn these parameters via machine learning. A probabilistic grammar's validity is constrained by context of its training dataset.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Sequitur is a recursive algorithm developed by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical structure from a sequence of discrete symbols. The algorithm operates in linear space and time. It can be used in data compression software applications.

An image file format is a file format for a digital image. There are many formats that can be used, such as JPEG, PNG, and GIF. Most formats up until 2022 were for storing 2D images, not 3D ones. The data stored in an image file format may be compressed or uncompressed. If the data is compressed, it may be done so using lossy compression or lossless compression. For graphic design applications, vector formats are often used. Some image file formats support transparency.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. To compress a data sequence $, a grammar-based code transforms into a context-free grammar . The problem of finding a smallest grammar for an input sequence is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar is further compressed by statistical encoders like arithmetic coding.$

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

In data compression and the theory of formal languages, the smallest grammar problem is the problem of finding the smallest context-free grammar that generates a given string of characters. The size of a grammar is defined by some authors as the number of symbols on the right side of the production rules. Others also add the number of rules to that. A grammar that generates only a single string, as required for the solution to this problem, is called a straight-line grammar.

In computer science, a grammar is informally called a recursive grammar if it contains production rules that are recursive, meaning that expanding a non-terminal according to these rules can eventually lead to a string that includes the same non-terminal again. Otherwise it is called a non-recursive grammar.

In coding theory, a variable-length code is a code which maps source symbols to a variable number of bits. The equivalent concept in computer science is bit string.

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

Re-Pair is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar generating a single string: the input text. In order to perform the compression in linear time, it consumes the amount of memory that is approximately five times the size of its input.

References

1 2 Florian Benz and Timo Kötzing, “An effective heuristic for the smallest grammar problem,” Proceedings of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO ’13, 2013. ISBN 978-1-4503-1963-8 doi : 10.1145/2463372.2463441, p. 488
1 2 3 Markus Lohrey; Sebastian Maneth; Manfred Schmidt-Schauß (2009). "Parameter Reduction in Grammar-Compressed Trees". Proc. FOSSACS (PDF). LNCS. Vol. 5504. Springer. pp. 212–226.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[GECCO-1] 1 2 Florian Benz and Timo Kötzing, “An effective heuristic for the smallest grammar problem,” Proceedings of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO ’13, 2013. ISBN 978-1-4503-1963-8 doi : 10.1145/2463372.2463441, p. 488

[Lohrey.Maneth.Schmidt.2009-2] 1 2 3 Markus Lohrey; Sebastian Maneth; Manfred Schmidt-Schauß (2009). "Parameter Reduction in Grammar-Compressed Trees". Proc. FOSSACS (PDF). LNCS. Vol. 5504. Springer. pp. 212–226.

[1]

[2]