A syntactic predicate specifies the syntactic validity of applying a production in a formal grammar and is analogous to a semantic predicate that specifies the semantic validity of applying a production. It is a simple and effective means of dramatically improving the recognition strength of an LL parser by providing arbitrary lookahead. In their original implementation, syntactic predicates had the form “( α )?” and could only appear on the left edge of a production. The required syntactic condition α could be any valid context-free grammar fragment.
More formally, a syntactic predicate is a form of production intersection, used in parser specifications or in formal grammars. In this sense, the term predicate has the meaning of a mathematical indicator function. If p1 and p2, are production rules, the language generated by bothp1andp2 is their set intersection.
As typically defined or implemented, syntactic predicates implicitly order the productions so that predicated productions specified earlier have higher precedence than predicated productions specified later within the same decision. This conveys an ability to disambiguate ambiguous productions because the programmer can simply specify which production should match.
Parsing expression grammars (PEGs), invented by Bryan Ford, extend these simple predicates by allowing "not predicates" and permitting a predicate to appear anywhere within a production. Moreover, Ford invented packrat parsing to handle these grammars in linear time by employing memoization, at the cost of heap space.
It is possible to support linear-time parsing of predicates as general as those allowed by PEGs, but reduce the memory cost associated with memoization by avoiding backtracking where some more efficient implementation of lookahead suffices. This approach is implemented by ANTLR version 3, which uses Deterministic finite automata for lookahead; this may require testing a predicate in order to choose between transitions of the DFA (called "pred-LL(*)" parsing). [1]
The term syntactic predicate was coined by Parr & Quong and differentiates this form of predicate from semantic predicates (also discussed). [2]
Syntactic predicates have been called multi-step matching, parse constraints, and simply predicates in various literature. (See References section below.) This article uses the term syntactic predicate throughout for consistency and to distinguish them from semantic predicates.
Bar-Hillel et al. [3] show that the intersection of two regular languages is also a regular language, which is to say that the regular languages are closed under intersection.
The intersection of a regular language and a context-free language is also closed, and it has been known at least since Hartmanis [4] that the intersection of two context-free languages is not necessarily a context-free language (and is thus not closed). This can be demonstrated easily using the canonical Type 1 language, :
Let (Type 2) Let (Type 2) Let
Given the strings abcc, aabbc, and aaabbbccc, it is clear that the only string that belongs to both L1and L2 (that is, the only one that produces a non-empty intersection) is aaabbbccc.
In most formalisms that use syntactic predicates, the syntax of the predicate is noncommutative, which is to say that the operation of predication is ordered. For instance, using the above example, consider the following pseudo-grammar, where X ::= Y PRED Z is understood to mean: "Y produces X if and only if Y also satisfies predicate Z":
S ::= a X X ::= Y PRED Z Y ::= a+ BNCN Z ::= ANBN c+ BNCN ::= b [BNCN] c ANBN ::= a [ANBN] b
Given the string aaaabbbccc, in the case where Y must be satisfied first (and assuming a greedy implementation), S will generate aX and X in turn will generate aaabbbccc, thereby generating aaaabbbccc. In the case where Z must be satisfied first, ANBN will fail to generate aaaabbb, and thus aaaabbbccc is not generated by the grammar. Moreover, if either Y or Z (or both) specify any action to be taken upon reduction (as would be the case in many parsers), the order that these productions match determines the order in which those side-effects occur. Formalisms that vary over time (such as adaptive grammars) may rely on these side effects.
Parr & Quong [5] give this example of a syntactic predicate:
stat:(declaration)?declaration|expression;
which is intended to satisfy the following informally stated [6] constraints of C++:
In the first production of rule stat, the syntactic predicate (declaration)? indicates that declaration is the syntactic context that must be present for the rest of that production to succeed. We can interpret the use of (declaration)? as "I am not sure if declaration will match; let me try it out and, if it does not match, I shall try the next alternative." Thus, when encountering a valid declaration, the rule declaration will be recognized twice—once as syntactic predicate and once during the actual parse to execute semantic actions.
Of note in the above example is the fact that any code triggered by the acceptance of the declaration production will only occur if the predicate is satisfied.
The language can be represented in various grammars and formalisms as follows:
S←&(A!b)a+B!cA←aA?bB←bB?c
Using a bound predicate:
S → {A}B
A → X 'c+' X → 'a' [X] 'b' B → 'a+' Y Y → 'b' [Y] 'c'
Using two free predicates:
A → <'a+'>a <'b+'>b Ψ(ab)X <'c+'>c Ψ(bc)Y
X → 'a' [X] 'b' Y → 'b' [Y] 'c'
(Note: the following example actually generates , but is included here because it is the example given by the inventor of conjunctive grammars. [7] ):
S → AB&DC A → aA | ε B → bBc | ε C → cC | ε D → aDb | ε
rule S { <before <A> <!before b>> a+ <B> <!before c> } rule A { a <A>? b } rule B { b <B>? c }
Although by no means an exhaustive list, the following parsers and grammar formalisms employ syntactic predicates:
<before ...>
" or "<!before ...>
" (that is: "not before"). Perl 5 also has such lookahead, but it can only encapsulate Perl 5's more limited regexp features.In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form
In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.
In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.
In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.
A canonical LR parser is a type of bottom-up parsing algorithm used in computer science to analyze and process programming languages. It is based on the LR parsing technique, which stands for "left-to-right, rightmost derivation in reverse."
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free grammars have rules for rewriting symbols as strings of other symbols, tree-adjoining grammars have rules for rewriting the nodes of trees as other trees.
In computer-based language recognition, ANTLR, or ANother Tool for Language Recognition, is a parser generator that uses a LL(*) algorithm for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is Professor Terence Parr of the University of San Francisco.
Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.
In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.
Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.
Boolean grammars, introduced by Okhotin, are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with conjunction and negation operations. Besides these explicit operations, Boolean grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction and negation can be used, in particular, to specify intersection and complement of languages. An intermediate class of grammars known as conjunctive grammars allows conjunction and disjunction, but not negation.
Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.
An adaptive grammar is a formal grammar that explicitly provides mechanisms within the formalism to allow its own production rules to be manipulated.
Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structure, quantification and information structure. The formalism generates constituency-based structures and is therefore a type of phrase structure grammar.
A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.
SLR grammars are the class of formal grammars accepted by a Simple LR parser. SLR grammars are a superset of all LR(0) grammars and a subset of all LALR(1) and LR(1) grammars.
Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the mildly context-sensitive languages.
In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.
Dynamic Syntax (DS) is a grammar formalism and linguistic theory whose overall aim is to explain the real-time processes of language understanding and production, and describe linguistic structures as happening step-by-step over time. Under the DS approach, syntactic knowledge is understood as the ability to incrementally analyse the structure and content of spoken and written language in context and in real-time. While it posits representations similar to those used in Combinatory categorial grammars (CCG), it builds those representations left-to-right going word-by-word. Thus it differs from other syntactic models which generally abstract away from features of everyday conversation such as interruption, backtracking, and self-correction. Moreover, it differs from other approaches in that it does not postulate an independent level of syntactic structure over words.