Tree kernel

Last updated February 12, 2024

In machine learning, tree kernels are the application of the more general concept of positive-definite kernel to tree structures. They find applications in natural language processing, where they can be used for machine-learned parsing or classification of sentences.

Motivation

In natural language processing, it is often necessary to compare tree structures (e.g. parse trees) for similarity. Such comparisons can be performed by computing dot products of vectors of features of the trees, but these vectors tend to be very large: NLP techniques have come to a point where a simple dependency relation over two words is encoded with a vector of several millions of features.^[1] It can be impractical to represent complex structures such as trees with features vectors. Well-designed kernels allow computing similarity over trees without explicitly computing the feature vectors of these trees. Moreover, kernel methods have been widely used in machine learning tasks (e.g. SVM), and thus plenty of algorithms are working natively with kernels, or have an extension that handles kernelization.

An example application is classification of sentences, such as different types of questions.^[2]

Examples

Constituency parse tree for the sentence : "A cat eats a mouse." Acat.svg — Constituency parse tree for the sentence : "A cat eats a mouse."

Same as above, for the sentence : "A mouse eats a cat." Amouse.svg — Same as above, for the sentence : "A mouse eats a cat."

Here are presented two examples of tree kernel applied to the constituency trees of the sentences "A cat eats a mouse." and "A mouse eats a cat.". In this example "A" and "a" are the same words, and in most of the NLP applications they would be represented with the same token.

The interest of these two kernels is that they show very different granularity (the subset tree kernel being far more fine-grained than the subtree kernel), for the same computation complexity. Both can be computed recursively in time O(|T₁|.|T₂|).^[3]

Subtree kernel

In the case of constituency tree, a subtree is defined as a node and all its children (e.g., [NP [D [A]] [N [mouse]]] is a subtree of the two trees). Terminals are not considered subtree (e.g. [a] is not a subtree). The subtree kernel count the number of common subtrees between two given trees.

In this example, there are seven common subtrees:

[NP [D [a]] [N [cat]]],

[NP [D [a]] [N [mouse]]],

[N [mouse]],

[N [cat]],

[V [eats]],

[D [a]] (counted twice as it appears twice).

Subset tree kernel

A subset tree is a more general structure than a subtree. The basic definition is the same, but in the case of subset trees, leaves need not be terminals (e.g., [VP [V] [NP]] is a subset tree of both trees), but here two single nodes are not considered as trees. Because of this more general definition, there are more subset trees than subtrees, and more common subset trees than common subtrees.

In this example, there are 54 common subset trees. The seven common subtrees plus among others:

[NP [D] [N]] (counted twice),

[VP [V [eats]] [NP]]...

Notes

↑ McDonald, Ryan; Pereira, Fernando; Ribarov, Kiril; Hajič, Jan (2005). Non-projective Dependency Parsing using Spanning Tree Algorithms. HLT–EMNLP.
↑ Zhang, Dell; Lee, Wee Sun (2003). Question classification using support vector machines. SIGIR.
↑ Collins, Michael; Duffy, Nigel (2001). Convolution kernels for natural language. Advances in Neural Information Processing Systems.

Related Research Articles

A computer program is a sequence or set of instructions in a programming language for a computer to execute. It is one component of software, which also includes documentation and other intangible components.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer programming, an S-expression is an expression in a like-named notation for nested list (tree-structured) data. S-expressions were invented for and popularized by the programming language Lisp, which uses them for source code as well as data.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

<span class="mw-page-title-main">Parse tree</span> Tree in formal language theory

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Link grammar (LG) is a theory of syntax by Davy Temperley and Daniel Sleator which builds relations between pairs of words, rather than constructing constituents in a phrase structure hierarchy. Link grammar is similar to dependency grammar, but dependency grammar includes a head-dependent relationship, whereas link grammar makes the head-dependent relationship optional. Colored Multiplanar Link Grammar (CMLG) is an extension of LG allowing crossing relations between pairs of words. The relationship between words is indicated with link types, thus making the Link grammar closely related to certain categorial grammars.

Shallow parsing is an analysis of a sentence which first identifies constituent parts of sentences and then links them to higher order units that have discrete grammatical meanings. While the most elementary chunking algorithms simply link constituent parts on the basis of elementary search patterns, approaches that use machine learning techniques can take contextual information into account and thus compose chunks in such a way that they better reflect the semantic relations between the basic constituents. That is, these more advanced methods get around the problem that combinations of elementary constituents can have different higher level meanings depending on the context of the sentence.

A definite clause grammar (DCG) is a way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog. It is closely related to the concept of attribute grammars / affix grammars. DCGs are usually associated with Prolog, but similar languages such as Mercury also include DCGs. They are called definite clause grammars because they represent a grammar as a set of definite clauses in first-order logic.

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Document clustering is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

Natural-language programming (NLP) is an ontology-assisted way of programming in terms of natural-language sentences, e.g. English. A structured document with Content, sections and subsections for explanations of sentences forms a NLP document, which is actually a computer program. Natural language programming is not to be mixed up with natural language interfacing or voice control where a program is first written and then communicated with through natural language using an interface added on. In NLP the functionality of a program is organised only for the definition of the meaning of sentences. For instance, NLP can be used to represent all the knowledge of an autonomous robot. Having done so, its tasks can be scripted by its users so that the robot can execute them autonomously while keeping to prescribed rules of behaviour as determined by the robot's user. Such robots are called transparent robots as their reasoning is transparent to users and this develops trust in robots. Natural language use and natural-language user interfaces include Inform 7, a natural programming language for making interactive fiction, Shakespeare, an esoteric natural programming language in the style of the plays of William Shakespeare, and Wolfram Alpha, a computational knowledge engine, using natural-language input. Some methods for program synthesis are based on natural-language programming.

The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors in a feature space over polynomials of the original variables, allowing learning of non-linear models.

The following outline is provided as an overview of and topical guide to natural-language processing:

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

Native-language identification (NLI) is the task of determining an author's native language (L1) based only on their writings in a second language (L2). NLI works through identifying language-usage patterns that are common to specific L1 groups and then applying this knowledge to predict the native language of previously unseen texts. This is motivated in part by applications in second-language acquisition, language teaching and forensic linguistics, amongst others.

References

Jun Sun, Min Zhang and Chew Lim Tan. Tree Sequence Kernel for Natural Language
Alessandro Moschitti. Making Tree Kernels practical for Natural Language Learning

External links

http://disi.unitn.it/moschitti/Tree-Kernel.htm -- Application of tree kernel to SVM, on Alessandro Moschitti web-page.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] McDonald, Ryan; Pereira, Fernando; Ribarov, Kiril; Hajič, Jan (2005). Non-projective Dependency Parsing using Spanning Tree Algorithms. HLT–EMNLP.

[zhang-2] Zhang, Dell; Lee, Wee Sun (2003). Question classification using support vector machines. SIGIR.

[collins-3] Collins, Michael; Duffy, Nigel (2001). Convolution kernels for natural language. Advances in Neural Information Processing Systems.

[1]

[2]

[3]