DisCoCat

Last updated

DisCoCat (Categorical Compositional Distributional) is a mathematical framework for natural language processing which uses category theory to unify distributional semantics with the principle of compositionality. The grammatical derivations in a categorial grammar (usually a pregroup grammar) are interpreted as linear maps acting on the tensor product of word vectors to produce the meaning of a sentence or a piece of text. String diagrams are used to visualise information flow and reason about natural language semantics.

Contents

History

The framework was first introduced by Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark [1] as an application of categorical quantum mechanics to natural language processing. It started with the observation that pregroup grammars and quantum processes shared a common mathematical structure: they both form a rigid category (also known as a non-symmetric compact closed category). As such, they both benefit from a graphical calculus, which allows a purely diagrammatic reasoning. Although the analogy with quantum mechanics was kept informal at first, it eventually led to the development of quantum natural language processing. [2] [3]

Definition

There are multiple definitions of DisCoCat in the literature, depending on the choice made for the compositional aspect of the model. The common denominator between all the existent versions, however, always involves a categorical definition of DisCoCat as a structure-preserving functor from a category of grammar to a category of semantics, which usually encodes the distributional hypothesis.

The original paper [1] used the categorical product of FinVect with a pregroup seen as a posetal category. This approach has some shortcomings: all parallel arrows of a posetal category are equal, which means that pregroups cannot distinguish between different grammatical derivations for the same syntactically ambiguous sentence. [4] A more intuitive manner of saying the same is that one works with diagrams rather than with partial orders when describing grammar.

This problem is overcome when one considers the free rigid category generated by the pregroup grammar. [5] That is, has generating objects for the words and the basic types of the grammar, and generating arrows for the dictionary entries which assign a pregroup type to a word . The arrows are grammatical derivations for the sentence which can be represented as string diagrams with cups and caps, i.e. adjunction units and counits. [6]

With this definition of pregroup grammars as free rigid categories, DisCoCat models can be defined as strong monoidal functors . Spelling things out in detail, they assign a finite dimensional vector space to each basic type and a vector in the appropriate tensor product space to each dictionary entry where (objects for words are sent to the monoidal unit, i.e. ). The meaning of a sentence is then given by a vector which can be computed as the contraction of a tensor network. [7]

The reason behind the choice of as the category of semantics is that vector spaces are the usual setting of distributional reading in computational linguistics and natural language processing. The underlying idea of distributional hypothesis "A word is characterized by the company it keeps" is particularly relevant when assigning meaning to words like adjectives or verbs, whose semantic connotation is strongly dependent on context.

Variations

Variations of DisCoCat have been proposed with a different choice for the grammar category. The main motivation behind this lies in the fact that pregroup grammars have been proved to be weakly equivalent to context-free grammars. [8] One example of variation [9] chooses Combinatory categorial grammar as the grammar category.

List of linguistic phenomena

The DisCoCat framework has been used to study the following phenomena from linguistics.

Applications in NLP

The DisCoCat framework has been applied to solve the following tasks in natural language processing.

See also

Related Research Articles

In mathematics and computer science, currying is the technique of translating a function that takes multiple arguments into a sequence of families of functions, each taking a single argument.

In physics, the no-cloning theorem states that it is impossible to create an independent and identical copy of an arbitrary unknown quantum state, a statement which has profound implications in the field of quantum computing among others. The theorem is an evolution of the 1970 no-go theorem authored by James Park, in which he demonstrates that a non-disturbing measurement scheme which is both simple and perfect cannot exist. The aforementioned theorems do not preclude the state of one system becoming entangled with the state of another as cloning specifically refers to the creation of a separable state with identical factors. For example, one might use the controlled NOT gate and the Walsh–Hadamard gate to entangle two qubits without violating the no-cloning theorem as no well-defined state may be defined in terms of a subsystem of an entangled state. The no-cloning theorem concerns only pure states whereas the generalized statement regarding mixed states is known as the no-broadcast theorem.

In computer science, formal methods are mathematically rigorous techniques for the specification, development, analysis, and verification of software and hardware systems. The use of formal methods for software and hardware design is motivated by the expectation that, as in other engineering disciplines, performing appropriate mathematical analysis can contribute to the reliability and robustness of a design.

In programming language theory, semantics is the rigorous mathematical study of the meaning of programming languages. Semantics assigns computational meaning to valid strings in a programming language syntax. It is closely related to, and often crosses over with, the semantics of mathematical proofs.

<span class="mw-page-title-main">Samson Abramsky</span> British computer scientist

Samson Abramsky is Professor of Computer Science at University College London. He was previously the Christopher Strachey Professor of Computing at Wolfson College, Oxford, from 2000 to 2021.

Bunched logic is a variety of substructural logic proposed by Peter O'Hearn and David Pym. Bunched logic provides primitives for reasoning about resource composition, which aid in the compositional analysis of computer and other systems. It has category-theoretic and truth-functional semantics, which can be understood in terms of an abstract concept of resource, and a proof theory in which the contexts Γ in an entailment judgement Γ ⊢ A are tree-like structures (bunches) rather than lists or (multi)sets as in most proof calculi. Bunched logic has an associated type theory, and its first application was in providing a way to control the aliasing and other forms of interference in imperative programs. The logic has seen further applications in program verification, where it is the basis of the assertion language of separation logic, and in systems modelling, where it provides a way to decompose the resources used by components of a system.

String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector spaces and linear maps with the tensor product, string diagrams are called tensor networks or Penrose graphical notation. This has led to the development of categorical quantum mechanics where the axioms of quantum theory are expressed in the language of monoidal categories.

In category theory, a branch of mathematics, dagger compact categories first appeared in 1989 in the work of Sergio Doplicher and John E. Roberts on the reconstruction of compact topological groups from their category of finite-dimensional continuous unitary representations. They also appeared in the work of John Baez and James Dolan as an instance of semistrict k-tuply monoidal n-categories, which describe general topological quantum field theories, for n = 1 and k = 3. They are a fundamental structure in Samson Abramsky and Bob Coecke's categorical quantum mechanics.

<span class="mw-page-title-main">Distributional semantics</span> Field of linguistics

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

Categorical quantum mechanics is the study of quantum foundations and quantum information using paradigms from mathematics and computer science, notably monoidal category theory. The primitive objects of study are physical processes, and the different ways that these can be composed. It was pioneered in 2004 by Samson Abramsky and Bob Coecke. Categorical quantum mechanics is entry 18M40 in MSC2020.

<span class="mw-page-title-main">Bob Coecke</span>

Bob Coecke is a Belgian theoretical physicist and logician who was professor of Quantum foundations, Logics and Structures at Oxford University until 2020, when he became Chief Scientist of Cambridge Quantum Computing, and after the merger with Honeywell Quantum Systems, Chief Scientist of Quantinuum. In January 2023 he also became Distinguished Visiting Research Chair at the Perimeter Institute for Theoretical Physics. He pioneered categorical quantum mechanics, Quantum Picturalism, ZX-calculus, DisCoCat model for natural language, and quantum natural language processing (QNLP). He is a founder of the Quantum Physics and Logic community and conference series, and of the applied category theory community, conference series, and diamond-open-access journal Compositionality.

Pregroup grammar (PG) is a grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rather, PG uses inverse types combined with its monoidal operation.

Quantum contextuality is a feature of the phenomenology of quantum mechanics whereby measurements of quantum observables cannot simply be thought of as revealing pre-existing values. Any attempt to do so in a realistic hidden-variable theory leads to values that are dependent upon the choice of the other (compatible) observables which are simultaneously measured. More formally, the measurement result of a quantum observable is dependent upon which other commuting observables are within the same measurement set.

Quantum foundations is a discipline of science that seeks to understand the most counter-intuitive aspects of quantum theory, reformulate it and even propose new generalizations thereof. Contrary to other physical theories, such as general relativity, the defining axioms of quantum theory are quite ad hoc, with no obvious physical intuition. While they lead to the right experimental predictions, they do not come with a mental picture of the world where they fit.

The ZX-calculus is a rigorous graphical language for reasoning about linear maps between qubits, which are represented as string diagrams called ZX-diagrams. A ZX-diagram consists of a set of generators called spiders that represent specific tensors. These are connected together to form a tensor network similar to Penrose graphical notation. Due to the symmetries of the spiders and the properties of the underlying category, topologically deforming a ZX-diagram does not affect the linear map it represents. In addition to the equalities between ZX-diagrams that are generated by topological deformations, the calculus also has a set of graphical rewrite rules for transforming diagrams into one another. The ZX-calculus is universal in the sense that any linear map between qubits can be represented as a diagram, and different sets of graphical rewrite rules are complete for different families of linear maps. ZX-diagrams can be seen as a generalisation of quantum circuit notation.

Applied category theory is an academic discipline in which methods from category theory are used to study other fields including but not limited to computer science, physics, natural language processing, control theory, probability theory and causality. The application of category theory in these domains can take different forms. In some cases the formalization of the domain into the language of category theory is the goal, the idea here being that this would elucidate the important structure and properties of the domain. In other cases the formalization is used to leverage the power of abstraction in order to prove new results about the field.

In type theory, session types are used to ensure correctness in concurrent programs. They guarantee that messages sent and received between concurrent programs are in the expected order and of the expected type. Session type systems have been adapted for both channel and actor systems.

Mehrnoosh Sadrzadeh is an Iranian British academic who is a professor at University College London. She was awarded a senior research fellowship at the Royal Academy of Engineering in 2022.

Quantum natural language processing (QNLP) is the application of quantum computing to natural language processing (NLP). It computes word embeddings as parameterised quantum circuits that can solve NLP tasks faster than any classical computer. It is inspired by categorical quantum mechanics and the DisCoCat framework, making use of string diagrams to translate from grammatical structure to quantum processes.

In the mathematical field of category theory, FinVect is the category whose objects are all finite-dimensional vector spaces and whose morphisms are all linear maps between them.

References

  1. 1 2 Coecke, Bob; Sadrzadeh, Mehrnoosh; Clark, Stephen (2010-03-23). "Mathematical Foundations for a Compositional Distributional Model of Meaning". arXiv: 1003.4394 [cs.CL].
  2. Zeng, William; Coecke, Bob (2016-08-02). "Quantum Algorithms for Compositional Natural Language Processing". Electronic Proceedings in Theoretical Computer Science. 221: 67–75. arXiv: 1608.01406 . doi:10.4204/EPTCS.221.8. ISSN   2075-2180. S2CID   14897915.
  3. Coecke, Bob; de Felice, Giovanni; Meichanetzidis, Konstantinos; Toumi, Alexis (2020-12-07). "Foundations for Near-Term Quantum Natural Language Processing". arXiv: 2012.03755 [quant-ph].
  4. Preller, Anne (2014-12-27). "From Logical to Distributional Models". Electronic Proceedings in Theoretical Computer Science. 171: 113–131. arXiv: 1412.8527 . doi:10.4204/EPTCS.171.11. ISSN   2075-2180. S2CID   18631267.
  5. Preller, Anne; Lambek, Joachim (2007-01-18). "Free Compact 2-Categories". Mathematical Structures in Computer Science. 17 (doi: 10.1017/S0960129506005901): 309. doi:10.1017/S0960129506005901. S2CID   10763735.
  6. Selinger, Peter (2010). "A survey of graphical languages for monoidal categories". New Structures for Physics. Lecture Notes in Physics. Vol. 813. pp. 289–355. arXiv: 0908.3347 . doi:10.1007/978-3-642-12821-9_4. ISBN   978-3-642-12820-2. S2CID   8477212.
  7. de Felice, Giovanni; Meichanetzidis, Konstantinos; Toumi, Alexis (2020-09-15). "Functorial Question Answering". Electronic Proceedings in Theoretical Computer Science. 323: 84–94. arXiv: 1905.07408 . doi:10.4204/EPTCS.323.6. ISSN   2075-2180. S2CID   195874109.
  8. Buszkowski, Wojciech (2001). "Lambek grammars based on pregroups". In International Conference on Logical Aspects of Computational Linguistics.
  9. Yeung, Richie; Kartsaklis, Dimitri (2021). "A CCG-based version of the DisCoCat framework". arXiv: 2105.07720 [cs.CL].
  10. Sadrzadeh, Mehrnoosh; Kartsaklis, Dimitri; Balkır, Esma (2018). "Sentence entailment in compositional distributional semantics". Annals of Mathematics and Artificial Intelligence. 82 (4): 189–218. arXiv: 1512.04419 . doi: 10.1007/s10472-017-9570-x . S2CID   5038840.
  11. Kartsaklis, Dimitri (2016). "Coordination in Categorical Compositional Distributional Semantics". Electronic Proceedings in Theoretical Computer Science. 221: 29–38. arXiv: 1606.01515 . doi:10.4204/EPTCS.221.4. S2CID   10842035.
  12. Bankova, Dea; Coecke, Bob; Lewis, Martha; Marsden, Dan (2018). "Graded hyponymy for compositional distributional semantics". Journal of Language Modelling. 6 (2): 225–260.
  13. Meyer, Francois; Lewis, Martha (2020-10-12). "Modelling Lexical Ambiguity with Density Matrices". arXiv: 2010.05670 [cs.CL].
  14. Coecke, Bob; de Felice, Giovanni; Marsden, Dan; Toumi, Alexis (2018-11-08). "Towards Compositional Distributional Discourse Analysis". Electronic Proceedings in Theoretical Computer Science. 283: 1–12. arXiv: 1811.03277 . doi: 10.4204/EPTCS.283.1 . ISSN   2075-2180.
  15. Wijnholds, Gijs; Sadrzadeh, Mehrnoosh (2019). "A type-driven vector semantics for ellipsis with anaphora using lambek calculus with limited contraction". Journal of Logic, Language and Information. 28 (2): 331–358. arXiv: 1905.01647 . doi: 10.1007/s10849-019-09293-4 . S2CID   146120631.
  16. Bradley, Tai-Danae; Lewis, Martha; Master, Jade; Theilman, Brad (2018). "Translating and Evolving: Towards a Model of Language Change in DisCoCat". Electronic Proceedings in Theoretical Computer Science. 283: 50–61. arXiv: 1811.11041 . doi:10.4204/EPTCS.283.4. S2CID   53775637.
  17. Grefenstette, Edward; Sadrzadeh, Mehrnoosh (2011-06-20). "Experimental Support for a Categorical Compositional Distributional Model of Meaning". arXiv: 1106.4058 [cs.CL].
  18. Kartsaklis, Dimitri; Sadrzadeh, Mehrnoosh (2013). "Prior disambiguation of word tensors for constructing sentence vectors".{{cite journal}}: Cite journal requires |journal= (help)
  19. Grefenstette, Edward; Dinu, Georgiana; Zhang, Yao-Zhong; Sadrzadeh, Mehrnoosh; Baroni, Marco (2013-01-30). "Multi-Step Regression Learning for Compositional Distributional Semantics". arXiv: 1301.6939 [cs.CL].
  20. de Felice, Giovanni; Meichanetzidis, Konstantinos; Toumi, Alexis (2019). "Functorial Question Answering". Electronic Proceedings in Theoretical Computer Science. 323: 84–94. arXiv: 1905.07408 . doi:10.4204/EPTCS.323.6. S2CID   195874109.
  21. Tyrrell, Brian (2018-11-08). "Applying Distributional Compositional Categorical Models of Meaning to Language Translation". Electronic Proceedings in Theoretical Computer Science. 283: 28–49. arXiv: 1811.03274 . doi: 10.4204/EPTCS.283.3 . ISSN   2075-2180.
  22. Coecke, Bob; de Felice, Giovanni; Marsden, Dan; Toumi, Alexis (2018-11-08). "Towards Compositional Distributional Discourse Analysis". Electronic Proceedings in Theoretical Computer Science. 283: 1–12. arXiv: 1811.03277 . doi: 10.4204/EPTCS.283.1 . ISSN   2075-2180.