Smallest grammar problem

Last updated

In data compression and the theory of formal languages, the smallest grammar problem is the problem of finding the smallest context-free grammar that generates a given string of characters (but no other string). The size of a grammar is defined by some authors as the number of symbols on the right side of the production rules. [1] Others also add the number of rules to that. [2] A grammar that generates only a single string, as required for the solution to this problem, is called a straight-line grammar. [3]

Contents

Every binary string of length has a grammar of length , as expressed using big O notation. [3] For binary de Bruijn sequences, no better length is possible. [4]

The (decision version of the) smallest grammar problem is NP-complete. [1] It can be approximated in polynomial time to within a logarithmic approximation ratio; more precisely, the ratio is where is the length of the given string and is the size of its smallest grammar. It is hard to approximate to within a constant approximation ratio. An improvement of the approximation ratio to would also improve certain algorithms for approximate addition chains. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accuracy to the observed data, the one generating the most concise explanation of data is more likely to be correct. MML was invented by Chris Wallace, first appearing in the seminal paper "An information measure for classification". MML is intended not just as a theoretical construct, but as a technique that may be deployed in practice. It differs from the related concept of Kolmogorov complexity in that it does not require use of a Turing-complete language to model data.

<span class="mw-page-title-main">Time complexity</span> Estimate of time taken for running an algorithm

In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, supposing that each elementary operation takes a fixed amount of time to perform. Thus, the amount of time taken and the number of elementary operations performed by the algorithm are taken to be related by a constant factor.

<span class="mw-page-title-main">Vertex cover</span> Subset of a graphs vertices, including at least one endpoint of every edge

In graph theory, a vertex cover of a graph is a set of vertices that includes at least one endpoint of every edge of the graph.

In computer science and operations research, approximation algorithms are efficient algorithms that find approximate solutions to optimization problems with provable guarantees on the distance of the returned solution to the optimal one. Approximation algorithms naturally arise in the field of theoretical computer science as a consequence of the widely believed P ≠ NP conjecture. Under this conjecture, a wide class of optimization problems cannot be solved exactly in polynomial time. The field of approximation algorithms, therefore, tries to understand how closely it is possible to approximate optimal solutions to such problems in polynomial time. In an overwhelming majority of the cases, the guarantee of such algorithms is a multiplicative one expressed as an approximation ratio or approximation factor i.e., the optimal solution is always guaranteed to be within a (predetermined) multiplicative factor of the returned solution. However, there are also many approximation algorithms that provide an additive guarantee on the quality of the returned solution. A notable example of an approximation algorithm that provides both is the classic approximation algorithm of Lenstra, Shmoys and Tardos for scheduling on unrelated parallel machines.

<span class="mw-page-title-main">Set cover problem</span> Classical problem in combinatorics

The set cover problem is a classical question in combinatorics, computer science, operations research, and complexity theory.

In computational complexity theory, the class APX is the set of NP optimization problems that allow polynomial-time approximation algorithms with approximation ratio bounded by a constant. In simple terms, problems in this class have efficient algorithms that can find an answer within some fixed multiplicative factor of the optimal answer.

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. To compress a data sequence , a grammar-based code transforms into a context-free grammar . The problem of finding a smallest grammar for an input sequence is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar is further compressed by statistical encoders like arithmetic coding.

A height function is a function that quantifies the complexity of mathematical objects. In Diophantine geometry, height functions quantify the size of solutions to Diophantine equations and are typically functions from a set of points on algebraic varieties to the real numbers.

MAX-3SAT is a problem in the computational complexity subfield of computer science. It generalises the Boolean satisfiability problem (SAT) which is a decision problem considered in complexity theory. It is defined as:

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

<span class="mw-page-title-main">Closest pair of points problem</span> Computational geometry problem

The closest pair of points problem or closest pair problem is a problem of computational geometry: given points in metric space, find a pair of points with the smallest distance between them. The closest pair problem for points in the Euclidean plane was among the first geometric problems that were treated at the origins of the systematic study of the computational complexity of geometric algorithms.

In 1973, Andrey Kolmogorov proposed a non-probabilistic approach to statistics and model selection. Let each datum be a finite binary string and a model be a finite set of binary strings. Consider model classes consisting of models of given maximal Kolmogorov complexity. The Kolmogorov structure function of an individual data string expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. The structure function determines all stochastic properties of the individual data string: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the true model is in the model class considered or not. In the classical case we talk about a set of data with a probability distribution, and the properties are those of the expectations. In contrast, here we deal with individual data strings and the properties of the individual string focused on. In this setting, a property holds with certainty rather than with high probability as in the classical case. The Kolmogorov structure function precisely quantifies the goodness-of-fit of an individual model with respect to individual data.

<span class="mw-page-title-main">Dense subgraph</span> Highly connected subgraph

In graph theory and computer science, a dense subgraph is a subgraph with many edges per vertex. This is formalized as follows: let G = (V, E) be an undirected graph and let S = (VS, ES) be a subgraph of G. Then the density of S is defined to be:

In computational complexity theory, a branch of computer science, the Max/min CSP/Ones classification theorems state necessary and sufficient conditions that determine the complexity classes of problems about satisfying a subset S of boolean relations. They are similar to Schaefer's dichotomy theorem, which classifies the complexity of satisfying finite sets of relations; however, the Max/min CSP/Ones classification theorems give information about the complexity of approximating an optimal solution to a problem defined by S.

In geometry, a partition of a polygon is a set of primitive units, which do not overlap and whose union equals the polygon. A polygon partition problem is a problem of finding a partition which is minimal in some sense, for example a partition with a smallest number of units or with units of smallest total side-length.

In mathematics, the incompressibility method is a proof method like the probabilistic method, the counting method or the pigeonhole principle. To prove that an object in a certain class satisfies a certain property, select an object of that class that is incompressible. If it does not satisfy the property, it can be compressed by computable coding. Since it can be generally proven that almost all objects in a given class are incompressible, the argument demonstrates that almost all objects in the class have the property involved. To select an incompressible object is ineffective, and cannot be done by a computer program. However, a simple counting argument usually shows that almost all objects of a given class can be compressed by only a few bits.

Identical-machines scheduling is an optimization problem in computer science and operations research. We are given n jobs J1, J2, ..., Jn of varying processing times, which need to be scheduled on m identical machines, such that a certain objective function is optimized, for example, the makespan is minimized.

References

  1. 1 2 Charikar, Moses; Lehman, Eric; Liu, Ding; Panigrahy, Rina; Prabhakaran, Manoj; Sahai, Amit; Shelat, Abhi (2005). "The Smallest Grammar Problem". IEEE Transactions on Information Theory. 51 (7): 2554–2576. CiteSeerX   10.1.1.185.2130 . doi:10.1109/TIT.2005.850116. S2CID   6900082. Zbl   1296.68086.
  2. Florian Benz and Timo Kötzing, “An effective heuristic for the smallest grammar problem,” Proceedings of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO ’13, 2013. ISBN   978-1-4503-1963-8 doi : 10.1145/2463372.2463441
  3. 1 2 Lohrey, Markus (2012). "Algorithmics on SLP-compressed strings: A survey" (PDF). Groups Complexity Cryptology. 4 (2): 241–299. doi:10.1515/GCC-2012-0016.
  4. Domaratzki, Michael; Pighizzini, Giovanni; Shallit, Jeffrey (2002). "Simulating finite automata with context-free grammars". Information Processing Letters. 84 (6): 339–344. doi:10.1016/S0020-0190(02)00316-2. MR   1937222.
  5. Charikar, Moses; Lehman, Eric; Liu, Ding; Panigrahy, Rina; Prabhakaran, Manoj; Rasala, April; Sahai, Amit; Shelat, Abhi (2002). "Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models" (PDF). Proceedings of the thirty-fourth annual ACM symposium on theory of computing (STOC 2002), Montreal, Quebec, Canada, May 19–21, 2002. New York, NY: ACM Press. pp. 792–801. doi:10.1145/509907.510021. ISBN   978-1-581-13495-7. S2CID   282489. Zbl   1192.68397.