JAPE (linguistics)

Last updated

In computational linguistics, JAPE is the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions. Thus it is useful for pattern-matching, semantic extraction, and many other operations over syntactic trees such as those produced by natural language parsers.

JAPE is a version of CPSL – Common Pattern Specification Language.

A JAPE grammar consists of a set of phases, each of which consists of a set of pattern/action rules. The phases run sequentially and constitute a cascade of finite state transducers over annotations. The left-hand-side (LHS) of the rules consist of an annotation pattern description. The right-hand-side (RHS) consists of annotation manipulation statements. Annotations matched on the LHS of a rule may be referred to on the RHS by means of labels that are attached to pattern elements.

Related Research Articles

In formal language theory, computer science and linguistics, the Chomsky hierarchy is a containment hierarchy of classes of formal grammars.

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parsers, and GLR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programming language. They are extensions of the basic Backus–Naur form (BNF) metasyntax notation.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters into a sequence of lexical tokens. A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

A tree automaton is a type of state machine. Tree automata deal with tree structures, rather than the strings of more conventional state machines.

Thue is an esoteric programming language invented by John Colagioia in early 2000. It is a meta-language that can be used to define or recognize Type-0 languages from the Chomsky hierarchy. Because it is able to define languages of such complexity, it is also Turing-complete itself. Thue is based on a nondeterministic string rewriting system called semi-Thue grammar, which itself is named after the Norwegian mathematician Axel Thue. The author describes it as follows: "Thue represents one of the simplest possible ways to construe constraint-based programming. It is to the constraint-based paradigm what languages like OISC are to the imperative paradigm; in other words, it's a tar pit."

The cut-elimination theorem is the central result establishing the significance of the sequent calculus. It was originally proved by Gerhard Gentzen in his landmark 1934 paper "Investigations in Logical Deduction" for the systems LJ and LK formalising intuitionistic and classical logic respectively. The cut-elimination theorem states that any judgement that possesses a proof in the sequent calculus making use of the cut rule also possesses a cut-free proof, that is, a proof that does not make use of the cut rule.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

In computer science, an operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

A "production system " is a computer program typically used to provide some form of artificial intelligence, which consists primarily of a set of rules about behavior but it also includes the mechanism necessary to follow those rules as the system responds to states of the world. Those rules, termed productions, are a basic representation found useful in automated planning, expert systems and action selection.

<span class="mw-page-title-main">Grammar induction</span>

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

Shape grammars in computation are a specific class of production systems that generate geometric shapes. Typically, shapes are 2- or 3-dimensional, thus shape grammars are a way to study 2- and 3-dimensional languages. Shape grammars were first introduced in a seminal article by George Stiny and James Gips in 1971. The mathematical and algorithmic foundations of shape grammars were developed in "Pictorial and Formal Aspects of Shapes and Shape Grammars" by George Stiny. Applications of shape grammars were first considered in "Shape Grammars and their Uses" by James Gips. These publications also contain two independent, though equivalent, constructions showing that shape grammars can simulate Turing machines.

In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined by a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

<span class="mw-page-title-main">General Architecture for Text Engineering</span>

General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages.

A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions is the main component in the specification of a formal grammar. The other components are a finite set of nonterminal symbols, a finite set of terminal symbols that is disjoint from and a distinguished symbol that is the start symbol.

In the fields of computational linguistics and applied linguistics, a morphological dictionary is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in natural language text. The corresponding lexical form of a surface form is the lemma followed by grammatical information. In English give, gives, giving, gave and given are surface forms of the verb give. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: morpheme-aligned dictionaries and full-form (non-aligned) dictionaries.

In formal language theory, a grammar describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

NooJ is a linguistic development environment software as well as a corpus processor constructed by Max Silberztein. NooJ allows linguists to construct the four classes of the Chomsky-Schützenberger hierarchy of generative grammars: Finite-State Grammars, Context-Free Grammars, Context-Sensitive Grammars as well as Unrestricted Grammars, using either a text editor, or a Graph editor.

References

Dhaval Thakker, Taha Osman, Phil Lakin, JAPE Grammar Tutorial, https://gate.ac.uk/sale/thakker-jape-tutorial/GATE%20JAPE%20manual.pdf