In computer science, a Van Wijngaarden grammar (also vW-grammar or W-grammar [1] ) is a formalism for defining formal languages. The name derives from the formalism invented by Adriaan van Wijngaarden [2] for the purpose of defining the ALGOL 68 programming language. The resulting specification [3] remains its most notable application.
Van Wijngaarden grammars address the problem that context-free grammars cannot express agreement or reference, where two different parts of the sentence must agree with each other in some way. For example, the sentence "The birds was eating" is not Standard English because it fails to agree on number. A context-free grammar would parse "The birds was eating" and "The birds were eating" and "The bird was eating" in the same way. However, context-free grammars have the benefit of simplicity whereas van Wijngaarden grammars are considered highly complex. [4]
W-grammars are two-level grammars: they are defined by a pair of grammars, that operate on different levels:
The set of strings generated by a W-grammar is defined by a two-stage process:
The consistent substitution used in the first step is the same as substitution in predicate logic, and actually supports logic programming; it corresponds to unification in Prolog, as noted by Alain Colmerauer [ where? ].
W-grammars are Turing complete; [5] hence, all decision problems regarding the languages they generate, such as
are undecidable.
Curtailed variants, known as affix grammars, were developed, and applied in compiler construction and to the description of natural languages.
Definite logic programs, that is, logic programs that make no use of negation, can be viewed as a subclass of W-grammars. [6]
In the 1950s, attempts started to apply computers to the recognition, interpretation and translation of natural languages, such as English and Russian. This requires a machine-readable description of the phrase structure of sentences, that can be used to parse and interpret them, and to generate them. Context-free grammars, a concept from structural linguistics, were adopted for this purpose; their rules can express how sentences are recursively built out of parts of speech, such as noun phrases and verb phrases, and ultimately, words, such as nouns, verbs, and pronouns.
This work influenced the design and implementation of programming languages, most notably, of ALGOL 60, which introduced a syntax description in Backus–Naur form.
However, context-free rules cannot express agreement or reference (anaphora), where two different parts of the sentence must agree with each other in some way.
These can be readily expressed in W-grammars. (See example below.)
Programming languages have the analogous notions of typing and scoping. A compiler or interpreter for the language must recognize which uses of a variable belong together (refer to the same variable). This is typically subject to constraints such as:
W-grammars are based on the idea of providing the nonterminal symbols of context-free grammars with attributes (or affixes) that pass information between the nodes of the parse tree, used to constrain the syntax and to specify the semantics.
This idea was well known at the time; e.g. Donald Knuth visited the ALGOL 68 design committee while developing his own version of it, attribute grammars. [7]
By augmenting the syntax description with attributes, constraints like the above can be checked, ruling many invalid programs out at compile time. As Van Wijngaarden wrote in his preface: [2]
My main objections were certain to me unnecessary restrictions and the definition of the syntax and semantics. Actually the syntax viewed in MR 75 produces a large number of programs, whereas I should prefer to have the subset of meaningful programs as large as possible, which requires a stricter syntax. [...] it soon became clear that some better tools than the Backus notation might be advantageous [...]. I developed a scheme [...] which enables the design of a language to carry much more information in the syntax than is normally carried.
Quite peculiar to W-grammars was their strict treatment of attributes as strings, defined by a context-free grammar, on which concatenation is the only possible operation; complex data structures and operations can be defined by pattern matching. (See example below.)
After their introduction in the 1968 ALGOL 68 "Final Report", W-grammars were widely considered as too powerful and unconstrained to be practical.[ citation needed ]
This was partly a consequence of the way in which they had been applied; the 1973 ALGOL 68 "Revised Report" contains a much more readable grammar, without modifying the W-grammar formalism itself.
Meanwhile, it became clear that W-grammars, when used in their full generality, are indeed too powerful for such practical purposes as serving as the input for a parser generator. They describe precisely all recursively enumerable languages, [8] which makes parsing impossible in general: it is an undecidable problem to decide whether a given string can be generated by a given W-grammar.
Hence, their use must be seriously constrained when used for automatic parsing or translation. Restricted and modified variants of W-grammars were developed to address this, e.g.
After the 1970s, interest in the approach waned; occasionally, new studies are published. [9]
In English, nouns, pronouns and verbs have attributes such as grammatical number, gender, and person, which must agree between subject, main verb, and pronouns referring to the subject:
are valid sentences; invalid are, for instance:
Here, agreement serves to stress that both pronouns (e.g. I and myself) refer to the same person.
A context-free grammar to generate all such sentences:
<sentence> ::= <subject> <verb> <object> <subject> ::= I | You | He | She | We | They <verb> ::= wash | washes <object> ::= myself | yourself | himself | herself | ourselves | yourselves | themselves
From sentence, we can generate all combinations:
I wash myself I wash yourself I wash himself [...] They wash yourselves They wash themselves
A W-grammar to generate only the valid sentences:
<sentence <NUMBER> <GENDER> <PERSON>> ::= <subject <NUMBER> <GENDER> <PERSON>> <verb <NUMBER> <PERSON>> <object <NUMBER> <GENDER> <PERSON>>
<subject singular <GENDER> 1st> ::= I <subject <NUMBER> <GENDER> 2nd> ::= You <subject singular male 3rd> ::= He <subject singular female 3rd> ::= She <subject plural <GENDER> 1st> ::= We <subject singular <GENDER> 3rd> ::= They
<verb singular 1st> ::= wash <verb singular 2nd> ::= wash <verb singular 3rd> ::= washes <verb plural <PERSON>> ::= wash
<object singular <GENDER> 1st> ::= myself <object singular <GENDER> 2nd> ::= yourself <object singular male 3rd> ::= himself <object singular female 3rd> ::= herself <object plural <GENDER> 1st> ::= ourselves <object plural <GENDER> 2nd> ::= yourselves <object plural <GENDER> 3rd> ::= themselves
<NUMBER> ::== singular | plural <GENDER> ::== male | female <PERSON> ::== 1st | 2nd | 3rd
A well-known non-context-free language is
A two-level grammar for this language is the metagrammar
together with grammar schema
Note. In future revisions, (i) it should be clarified the difference between , and X. (ii) If one substitutes a new letter, say C, for N1, is the language generated by the grammar preserved? End of note.
The Revised Report on the Algorithmic Language Algol 60 [10] defines a full context-free syntax for the language.
Assignments are defined as follows (section 4.2.1):
<left part> ::= <variable> := | <procedure identifier> := <left part list> ::= <left part> | <left part list> <left part> <assignment statement> ::= <left part list> <arithmetic expression> | <left part list> <Boolean expression>
A <variable> can be (amongst other things) an <identifier>, which in turn is defined as:
<identifier> ::= <letter> | <identifier> <letter> | <identifier> <digit>
Examples (section 4.2.2):
s:=p[0]:=n:=n+1+s n:=n+1 A:=B/C-v-q×S S[v,k+2]:=3-arctan(sTIMESzeta) V:=Q>Y^Z
Expressions and assignments must be type checked: for instance,
The rules above distinguish between <arithmetic expression> and <Boolean expression>, but they cannot verify that the same variable always has the same type.
This (non-context-free) requirement can be expressed in a W-grammar by annotating the rules with attributes that record, for each variable used or assigned to, its name and type.
This record can then be carried along to all places in the grammar where types need to be matched, and implement type checking.
Similarly, it can be used to checking initialization of variables before use, etcetera.
One may wonder how to create and manipulate such a data structure without explicit support in the formalism for data structures and operations on them. It can be done by using the metagrammar to define a string representation for the data structure and using pattern matching to define operations:
<left part with <TYPED> <NAME>> ::= <variable with <TYPED> <NAME>> := | <procedure identifier with <TYPED> <NAME>> := <left part list <TYPEMAP1>> ::= <left part with <TYPED> <NAME>> <where <TYPEMAP1> is <TYPED> <NAME> added to sorted <EMPTY>> | <left part list <TYPEMAP2>> <left part with <TYPED> <NAME>> <where <TYPEMAP1> is <TYPED> <NAME> added to sorted <TYPEMAP2>> <assignment statement <ASSIGNED TO> <USED>> ::= <left part list <ASSIGNED TO>> <arithmetic expression <USED>> | <left part list <ASSIGNED TO>> <Boolean expression <USED>> <where <TYPED> <NAME> is <TYPED> <NAME> added to sorted <EMPTY>> ::= <where <TYPEMAP1> is <TYPED1> <NAME1> added to sorted <TYPEMAP2>> ::= <where <TYPEMAP2> is <TYPED2> <NAME2> added to sorted <TYPEMAP3>> <where <NAME1> is lexicographically before <NAME2>> <where <TYPEMAP1> is <TYPED1> <NAME1> added to sorted <TYPEMAP2>> ::= <where <TYPEMAP2> is <TYPED2> <NAME2> added to sorted <TYPEMAP3>> <where <NAME2> is lexicographically before <NAME1>> <where <TYPEMAP3> is <TYPED1> <NAME1> added to sorted <TYPEMAP4>> <where <EMPTY> is lexicographically before <NAME1>> ::= <where <NAME1> is <LETTER OR DIGIT> followed by <NAME2>> <where <NAME1> is lexicographically before <NAME2>> ::= <where <NAME1> is <LETTER OR DIGIT> followed by <NAME3>> <where <NAME2> is <LETTER OR DIGIT> followed by <NAME4>> <where <NAME3> is lexicographically before <NAME4>> <where <NAME1> is lexicographically before <NAME2>> ::= <where <NAME1> is <LETTER OR DIGIT 1> followed by <NAME3>> <where <NAME2> is <LETTER OR DIGIT 2> followed by <NAME4>> <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 2> <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 2> ::= <where <LETTER OR DIGIT 1> precedes <LETTER OR DIGIT 2> <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 2> ::= <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 3> <where <LETTER OR DIGIT 3> precedes+ <LETTER OR DIGIT 2> <where a precedes b> :== <where b precedes c> :== [...] <TYPED> ::== real | integer | Boolean <NAME> ::== <LETTER> | <NAME> <LETTER> | <NAME> <DIGIT> <LETTER OR DIGIT> ::== <LETTER> | <DIGIT> <LETTER OR DIGIT 1> ::= <LETTER OR DIGIT> <LETTER OR DIGIT 2> ::= <LETTER OR DIGIT> <LETTER OR DIGIT 3> ::= <LETTER OR DIGIT> <LETTER> ::== a | b | c | [...] <DIGIT> ::== 0 | 1 | 2 | [...] <NAMES1> ::== <NAMES> <NAMES2> ::== <NAMES> <ASSIGNED TO> ::== <NAMES> <USED> ::== <NAMES> <NAMES> ::== <NAME> | <NAME> <NAMES> <EMPTY> ::== <TYPEMAP> ::== (<TYPED> <NAME>) <TYPEMAP> <TYPEMAP1> ::== <TYPEMAP> <TYPEMAP2> ::== <TYPEMAP> <TYPEMAP3> ::== <TYPEMAP>
When compared to the original grammar, three new elements have been added:
The new hyperrules are -rules: they only generate the empty string.
The ALGOL 68 reports use a slightly different notation without <angled brackets>.
a) program : open symbol, standard prelude, library prelude option, particular program, exit, library postlude option, standard postlude, close symbol. b) standard prelude : declaration prelude sequence. c) library prelude : declaration prelude sequence. d) particular program : label sequence option, strong CLOSED void clause. e) exit : go on symbol, letter e letter x letter i letter t, label symbol. f) library postlude : statement interlude. g) standard postlude : strong void clause train
program : strong void new closed clause A) EXTERNAL :: standard ; library ; system ; particular. B) STOP :: label letter s letter t letter o letter p. a) program text : STYLE begin token, new LAYER1 preludes, parallel token, new LAYER1 tasks PACK, STYLE end token. b) NEST1 preludes : NEST1 standard prelude with DECS1, NEST1 library prelude with DECSETY2, NEST1 system prelude with DECSETY3, where (NEST1) is (new EMPTY new DECS1 DECSETY2 DECSETY3). c) NEST1 EXTERNAL prelude with DECSETY1 : strong void NEST1 series with DECSETY1, go on token ; where (DECSETY1) is (EMPTY), EMPTY. d) NEST1 tasks : NEST1 system task list, and also token, NEST1 user task PACK list. e) NEST1 system task : strong void NEST1 unit. f) NEST1 user task : NEST2 particular prelude with DECS, NEST2 particular program PACK, go on token, NEST2 particular postlude, where (NEST2) is (NEST1 new DECS STOP). g) NEST2 particular program : NEST2 new LABSETY3 joined label definition of LABSETY3, strong void NEST2 new LABSETY3 ENCLOSED clause. h) NEST joined label definition of LABSETY : where (LABSETY) is (EMPTY), EMPTY ; where (LABSETY) is (LAB1 LABSETY1), NEST label definition of LAB1, NEST joined label definition of$ LABSETY1. i) NEST2 particular postlude : strong void NEST2 series with STOP.
A simple example of the power of W-grammars is clause
a) program text : STYLE begin token, new LAYER1 preludes, parallel token, new LAYER1 tasks PACK, STYLE end token.
This allows BEGIN ... END and {} as block delimiters, while ruling out BEGIN ... } and { ... END.
One may wish to compare the grammar in the report with the Yacc parser for a subset of ALGOL 68 by Marc van Leeuwen. [11]
Anthony Fisher wrote yo-yo, [12] a parser for a large class of W-grammars, with example grammars for expressions, eva, sal and Pascal (the actual ISO 7185 standard for Pascal uses extended Backus–Naur form).
Dick Grune created a C program that would generate all possible productions of a W-grammar. [13]
The applications of Extended Affix Grammars (EAG)s mentioned above can effectively be regarded as applications of W-grammars, since EAGs are so close to W-grammars. [14]
W-grammars have also been proposed for the description of complex human actions in ergonomics.[ citation needed ]
A W-Grammar Description has also been supplied for Ada. [15]
Atlas Autocode (AA) is a programming language developed around 1963 at the University of Manchester. A variant of the language ALGOL, it was developed by Tony Brooker and Derrick Morris for the Atlas computer. The initial AA and AB compilers were written by Jeff Rohl and Tony Brooker using the Brooker-Morris Compiler-compiler, with a later hand-coded non-CC implementation (ABC) by Jeff Rohl.
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form
In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.
In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.
A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.
ALGOL W is a programming language. It is based on a proposal for ALGOL X by Niklaus Wirth and Tony Hoare as a successor to ALGOL 60. ALGOL W is a relatively simple upgrade of the original ALGOL 60, adding string, bitstring, complex number and reference to record data types and call-by-result passing of parameters, introducing the while
statement, replacing switch
with the case
statement, and generally tightening up the language.
Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.
An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are the result of attribute evaluation rules associated with productions of the grammar. Attributes allow the transfer of information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.
In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record, which could be defined to contain both a float and an integer; in a union, there is only one value at any given time.
Link grammar (LG) is a theory of syntax by Davy Temperley and Daniel Sleator which builds relations between pairs of words, rather than constructing constituents in a phrase structure hierarchy. Link grammar is similar to dependency grammar, but dependency grammar includes a head-dependent relationship, whereas link grammar makes the head-dependent relationship optional. Colored Multiplanar Link Grammar (CMLG) is an extension of LG allowing crossing relations between pairs of words. The relationship between words is indicated with link types, thus making the Link grammar closely related to certain categorial grammars.
ALGOL 68 is an imperative programming language that was conceived as a successor to the ALGOL 60 programming language, designed with the goal of a much wider scope of application and more rigorously defined syntax and semantics.
Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.
In computer science, extended affix grammars (EAGs) are a formal grammar formalism for describing the context free and context sensitive syntax of language, both natural language and programming languages.
In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.
A definite clause grammar (DCG) is a way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog. It is closely related to the concept of attribute grammars / affix grammars. DCGs are usually associated with Prolog, but similar languages such as Mercury also include DCGs. They are called definite clause grammars because they represent a grammar as a set of definite clauses in first-order logic.
In computer language design, stropping is a method of explicitly marking letter sequences as having a special property, such as being a keyword, or a certain type of variable or storage location, and thus inhabiting a different namespace from ordinary names ("identifiers"), in order to avoid clashes. Stropping is not used in most modern languages – instead, keywords are reserved words and cannot be used as identifiers. Stropping allows the same letter sequence to be used both as a keyword and as an identifier, and simplifies parsing in that case – for example allowing a variable named if
without clashing with the keyword if.
An affix grammar is a two-level grammar formalism used to describe the syntax of languages, mainly computer languages, using an approach based on how natural language is typically described.
A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.
In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.