Van Wijngaarden grammar

Last updated

In computer science, a Van Wijngaarden grammar (also vW-grammar or W-grammar [1] ) is a formalism for defining formal languages. The name derives from the formalism invented by Adriaan van Wijngaarden [2] for the purpose of defining the ALGOL 68 programming language. The resulting specification [3] remains its most notable application.

Contents

Van Wijngaarden grammars address the problem that context-free grammars cannot express agreement or reference, where two different parts of the sentence must agree with each other in some way. For example, the sentence "The birds was eating" is not Standard English because it fails to agree on number. A context-free grammar would parse "The birds was eating" and "The birds were eating" and "The bird was eating" in the same way. However, context-free grammars have the benefit of simplicity whereas van Wijngaarden grammars are considered highly complex. [4]

Two levels

W-grammars are two-level grammars: they are defined by a pair of grammars, that operate on different levels:

The set of strings generated by a W-grammar is defined by a two-stage process:

  1. within each hyperrule, for each attribute that occurs in it, pick a value for it generated by the metagrammar; the result is a normal context-free grammar rule; do this in every possible way;
  2. use the resulting (possibly infinite) context-free grammar to generate strings in the normal way.

The consistent substitution used in the first step is the same as substitution in predicate logic, and actually supports logic programming; it corresponds to unification in Prolog, as noted by Alain Colmerauer [ where? ].

W-grammars are Turing complete; [5] hence, all decision problems regarding the languages they generate, such as

are undecidable.

Curtailed variants, known as affix grammars, were developed, and applied in compiler construction and to the description of natural languages.

Definite logic programs, that is, logic programs that make no use of negation, can be viewed as a subclass of W-grammars. [6]

Motivation and history

In the 1950s, attempts started to apply computers to the recognition, interpretation and translation of natural languages, such as English and Russian. This requires a machine-readable description of the phrase structure of sentences, that can be used to parse and interpret them, and to generate them. Context-free grammars, a concept from structural linguistics, were adopted for this purpose; their rules can express how sentences are recursively built out of parts of speech, such as noun phrases and verb phrases, and ultimately, words, such as nouns, verbs, and pronouns.

This work influenced the design and implementation of programming languages, most notably, of ALGOL 60, which introduced a syntax description in Backus–Naur form.

However, context-free rules cannot express agreement or reference (anaphora), where two different parts of the sentence must agree with each other in some way.

These can be readily expressed in W-grammars. (See example below.)

Programming languages have the analogous notions of typing and scoping. A compiler or interpreter for the language must recognize which uses of a variable belong together (refer to the same variable). This is typically subject to constraints such as:

W-grammars are based on the idea of providing the nonterminal symbols of context-free grammars with attributes (or affixes) that pass information between the nodes of the parse tree, used to constrain the syntax and to specify the semantics.

This idea was well known at the time; e.g. Donald Knuth visited the ALGOL 68 design committee while developing his own version of it, attribute grammars. [7]

By augmenting the syntax description with attributes, constraints like the above can be checked, ruling many invalid programs out at compile time. As Van Wijngaarden wrote in his preface: [2]

My main objections were certain to me unnecessary restrictions and the definition of the syntax and semantics. Actually the syntax viewed in MR 75 produces a large number of programs, whereas I should prefer to have the subset of meaningful programs as large as possible, which requires a stricter syntax. [...] it soon became clear that some better tools than the Backus notation might be advantageous [...]. I developed a scheme [...] which enables the design of a language to carry much more information in the syntax than is normally carried.

Quite peculiar to W-grammars was their strict treatment of attributes as strings, defined by a context-free grammar, on which concatenation is the only possible operation; complex data structures and operations can be defined by pattern matching. (See example below.)

After their introduction in the 1968 ALGOL 68 "Final Report", W-grammars were widely considered as too powerful and unconstrained to be practical.[ citation needed ]

This was partly a consequence of the way in which they had been applied; the 1973 ALGOL 68 "Revised Report" contains a much more readable grammar, without modifying the W-grammar formalism itself.

Meanwhile, it became clear that W-grammars, when used in their full generality, are indeed too powerful for such practical purposes as serving as the input for a parser generator. They describe precisely all recursively enumerable languages, [8] which makes parsing impossible in general: it is an undecidable problem to decide whether a given string can be generated by a given W-grammar.

Hence, their use must be seriously constrained when used for automatic parsing or translation. Restricted and modified variants of W-grammars were developed to address this, e.g.

After the 1970s, interest in the approach waned; occasionally, new studies are published. [9]

Examples

Agreement in English grammar

In English, nouns, pronouns and verbs have attributes such as grammatical number, gender, and person, which must agree between subject, main verb, and pronouns referring to the subject:

are valid sentences; invalid are, for instance:

Here, agreement serves to stress that both pronouns (e.g. I and myself) refer to the same person.

A context-free grammar to generate all such sentences:

<sentence> ::= <subject> <verb> <object> <subject>  ::= I | You | He | She | We | They <verb>     ::= wash | washes <object>   ::= myself | yourself | himself | herself | ourselves | yourselves | themselves

From sentence, we can generate all combinations:

I wash myself I wash yourself I wash himself [...] They wash yourselves They wash themselves

A W-grammar to generate only the valid sentences:

<sentence <NUMBER> <GENDER> <PERSON>>   ::= <subject <NUMBER> <GENDER> <PERSON>>       <verb <NUMBER> <PERSON>>       <object <NUMBER> <GENDER> <PERSON>>
<subject singular <GENDER> 1st> ::= I <subject <NUMBER> <GENDER> 2nd> ::= You <subject singular male 3rd>     ::= He <subject singular female 3rd>   ::= She <subject plural <GENDER> 1st>   ::= We <subject singular <GENDER> 3rd> ::= They
<verb singular 1st>    ::= wash <verb singular 2nd>    ::= wash <verb singular 3rd>    ::= washes <verb plural <PERSON>> ::= wash
<object singular <GENDER> 1st> ::= myself <object singular <GENDER> 2nd> ::= yourself <object singular male 3rd>     ::= himself <object singular female 3rd>   ::= herself <object plural <GENDER> 1st>   ::= ourselves <object plural <GENDER> 2nd>   ::= yourselves <object plural <GENDER> 3rd>   ::= themselves
<NUMBER> ::== singular | plural <GENDER> ::== male | female <PERSON> ::== 1st | 2nd | 3rd

A standard non-context-free language

A well-known non-context-free language is

A two-level grammar for this language is the metagrammar

N ::= 1 | N1
X ::= a | b

together with grammar schema

Start ::=
 ::=
 ::= X

Requiring valid use of variables in ALGOL

The Revised Report on the Algorithmic Language Algol 60 [10] defines a full context-free syntax for the language.

Assignments are defined as follows (section 4.2.1):

<left part>   ::= <variable> :=     | <procedure identifier> :=  <left part list>   ::= <left part>     | <left part list> <left part>  <assignment statement>   ::= <left part list> <arithmetic expression>     | <left part list> <Boolean expression>

A <variable> can be (amongst other things) an <identifier>, which in turn is defined as:

<identifier> ::= <letter> | <identifier> <letter> | <identifier> <digit>

Examples (section 4.2.2):

s:=p[0]:=n:=n+1+s n:=n+1 A:=B/C-v-q×S S[v,k+2]:=3-arctan(sTIMESzeta) V:=Q>Y^Z

Expressions and assignments must be type checked: for instance,

The rules above distinguish between <arithmetic expression> and <Boolean expression>, but they cannot verify that the same variable always has the same type.

This (non-context-free) requirement can be expressed in a W-grammar by annotating the rules with attributes that record, for each variable used or assigned to, its name and type.

This record can then be carried along to all places in the grammar where types need to be matched, and implement type checking.

Similarly, it can be used to checking initialization of variables before use, etcetera.

One may wonder how to create and manipulate such a data structure without explicit support in the formalism for data structures and operations on them. It can be done by using the metagrammar to define a string representation for the data structure and using pattern matching to define operations:

<left part with <TYPED> <NAME>>   ::= <variable with <TYPED> <NAME>> :=     | <procedure identifier with <TYPED> <NAME>> :=  <left part list <TYPEMAP1>>   ::= <left part with <TYPED> <NAME>>       <where <TYPEMAP1> is <TYPED> <NAME> added to sorted <EMPTY>>     | <left part list <TYPEMAP2>>       <left part with <TYPED> <NAME>>       <where <TYPEMAP1> is <TYPED> <NAME> added to sorted <TYPEMAP2>>  <assignment statement <ASSIGNED TO> <USED>>   ::= <left part list <ASSIGNED TO>> <arithmetic expression <USED>>     | <left part list <ASSIGNED TO>> <Boolean expression <USED>>  <where <TYPED> <NAME> is <TYPED> <NAME> added to sorted <EMPTY>>   ::=  <where <TYPEMAP1> is <TYPED1> <NAME1> added to sorted <TYPEMAP2>>   ::= <where <TYPEMAP2> is <TYPED2> <NAME2> added to sorted <TYPEMAP3>>       <where <NAME1> is lexicographically before <NAME2>>  <where <TYPEMAP1> is <TYPED1> <NAME1> added to sorted <TYPEMAP2>>   ::= <where <TYPEMAP2> is <TYPED2> <NAME2> added to sorted <TYPEMAP3>>       <where <NAME2> is lexicographically before <NAME1>>       <where <TYPEMAP3> is <TYPED1> <NAME1> added to sorted <TYPEMAP4>>  <where <EMPTY> is lexicographically before <NAME1>>   ::= <where <NAME1> is <LETTER OR DIGIT> followed by <NAME2>>   <where <NAME1> is lexicographically before <NAME2>>   ::= <where <NAME1> is <LETTER OR DIGIT> followed by <NAME3>>       <where <NAME2> is <LETTER OR DIGIT> followed by <NAME4>>       <where <NAME3> is lexicographically before <NAME4>>  <where <NAME1> is lexicographically before <NAME2>>   ::= <where <NAME1> is <LETTER OR DIGIT 1> followed by <NAME3>>       <where <NAME2> is <LETTER OR DIGIT 2> followed by <NAME4>>       <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 2>  <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 2>   ::= <where <LETTER OR DIGIT 1> precedes <LETTER OR DIGIT 2>  <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 2>   ::= <where <LETTER OR DIGIT 1> precedes+ <LETTER OR DIGIT 3>       <where <LETTER OR DIGIT 3> precedes+ <LETTER OR DIGIT 2>  <where a precedes b> :== <where b precedes c> :== [...]  <TYPED>       ::== real | integer | Boolean  <NAME>        ::== <LETTER> | <NAME> <LETTER> | <NAME> <DIGIT> <LETTER OR DIGIT> ::== <LETTER> | <DIGIT> <LETTER OR DIGIT 1> ::= <LETTER OR DIGIT> <LETTER OR DIGIT 2> ::= <LETTER OR DIGIT> <LETTER OR DIGIT 3> ::= <LETTER OR DIGIT> <LETTER>      ::== a | b | c | [...] <DIGIT>       ::== 0 | 1 | 2 | [...]  <NAMES1>      ::== <NAMES> <NAMES2>      ::== <NAMES> <ASSIGNED TO> ::== <NAMES> <USED>        ::== <NAMES> <NAMES>       ::== <NAME> | <NAME> <NAMES>  <EMPTY>       ::== <TYPEMAP>     ::== (<TYPED> <NAME>) <TYPEMAP> <TYPEMAP1>    ::== <TYPEMAP> <TYPEMAP2>    ::== <TYPEMAP> <TYPEMAP3>    ::== <TYPEMAP>

When compared to the original grammar, three new elements have been added:

The new hyperrules are -rules: they only generate the empty string.

ALGOL 68 examples

The ALGOL 68 reports use a slightly different notation without <angled brackets>.

ALGOL 68 as in the 1968 Final Report §2.1

a) program : open symbol, standard prelude,       library prelude option, particular program, exit,       library postlude option, standard postlude, close symbol.  b) standard prelude : declaration prelude sequence.  c) library prelude : declaration prelude sequence.  d) particular program :       label sequence option, strong CLOSED void clause.  e) exit : go on symbol, letter e letter x letter i letter t, label symbol.  f) library postlude : statement interlude.  g) standard postlude : strong void clause train

ALGOL 68 as in the 1973 Revised Report §2.2.1, §10.1.1

  program : strong void new closed clause   A) EXTERNAL :: standard ; library ; system ; particular.  B) STOP :: label letter s letter t letter o letter p.   a) program text : STYLE begin token, new LAYER1 preludes,          parallel token, new LAYER1 tasks PACK,          STYLE end token.  b) NEST1 preludes : NEST1 standard prelude with DECS1,          NEST1 library prelude with DECSETY2,          NEST1 system prelude with DECSETY3, where (NEST1) is         (new EMPTY new DECS1 DECSETY2 DECSETY3).  c) NEST1 EXTERNAL prelude with DECSETY1 :          strong void NEST1 series with DECSETY1, go on token ;          where (DECSETY1) is (EMPTY), EMPTY.  d) NEST1 tasks : NEST1 system task list, and also token,          NEST1 user task PACK list.  e) NEST1 system task : strong void NEST1 unit.  f) NEST1 user task : NEST2 particular prelude with DECS,          NEST2 particular program PACK, go on token,          NEST2 particular postlude,          where (NEST2) is (NEST1 new DECS STOP).  g) NEST2 particular program :          NEST2 new LABSETY3 joined label definition         of LABSETY3, strong void NEST2 new LABSETY3         ENCLOSED clause.  h) NEST joined label definition of LABSETY :          where (LABSETY) is (EMPTY), EMPTY ;          where (LABSETY) is (LAB1 LABSETY1),             NEST label definition of LAB1,             NEST joined label definition of$ LABSETY1.   i) NEST2 particular postlude :         strong void NEST2 series with STOP.

A simple example of the power of W-grammars is clause

a) program text : STYLE begin token, new LAYER1 preludes,         parallel token, new LAYER1 tasks PACK,         STYLE end token.

This allows BEGIN ... END and {} as block delimiters, while ruling out BEGIN ... } and { ... END.

One may wish to compare the grammar in the report with the Yacc parser for a subset of ALGOL 68 by Marc van Leeuwen. [11]

Implementations

Anthony Fisher wrote yo-yo, [12] a parser for a large class of W-grammars, with example grammars for expressions, eva, sal and Pascal (the actual ISO 7185 standard for Pascal uses extended Backus–Naur form).

Dick Grune created a C program that would generate all possible productions of a W-grammar. [13]

Applications outside of ALGOL 68

The applications of Extended Affix Grammars (EAG)s mentioned above can effectively be regarded as applications of W-grammars, since EAGs are so close to W-grammars. [14]

W-grammars have also been proposed for the description of complex human actions in ergonomics.[ citation needed ]

A W-Grammar Description has also been supplied for Ada. [15]

See also

Related Research Articles

Atlas Autocode (AA) is a programming language developed around 1963 at the University of Manchester. A variant of the language ALGOL, it was developed by Tony Brooker and Derrick Morris for the Atlas computer. The initial AA and AB compilers were written by Jeff Rohl and Tony Brooker using the Brooker-Morris Compiler-compiler, with a later hand-coded non-CC implementation (ABC) by Jeff Rohl.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In a computer language, a reserved word is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a reserved word may have no user-defined meaning.

In computer science, Backus–Naur form or Backus normal form (BNF) is a metasyntax notation for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols. It is applied wherever exact descriptions of languages are needed: for instance, in official language specifications, in manuals, and in textbooks on programming language theory.

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

<span class="mw-page-title-main">Parse tree</span> Tree in formal language theory

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

ALGOL W is a programming language. It is based on a proposal for ALGOL X by Niklaus Wirth and Tony Hoare as a successor to ALGOL 60. ALGOL W is a relatively simple upgrade of the original ALGOL 60, adding string, bitstring, complex number and reference to record data types and call-by-result passing of parameters, introducing the while statement, replacing switch with the case statement, and generally tightening up the language.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are result of attribute evaluation rules associated with productions of the grammar. Attributes allow to transfer information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.

In computer science, a union is a value that may have any of several representations or formats within the same position in memory; that consists of a variable that may hold such a data structure. Some programming languages support special data types, called union types, to describe such values and variables. In other words, a union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g., "float or long integer". In contrast with a record, which could be defined to contain both a float and an integer; in a union, there is only one value at any given time.

<span class="mw-page-title-main">ALGOL 68</span> Programming language

ALGOL 68 is an imperative programming language that was conceived as a successor to the ALGOL 60 programming language, designed with the goal of a much wider scope of application and more rigorously defined syntax and semantics.

In computer science, extended affix grammars (EAGs) are a formal grammar formalism for describing the context free and context sensitive syntax of language, both natural language and programming languages.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer programming, the lexer hack is a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer.

In computer language design, stropping is a method of explicitly marking letter sequences as having a special property, such as being a keyword, or a certain type of variable or storage location, and thus inhabiting a different namespace from ordinary names ("identifiers"), in order to avoid clashes. Stropping is not used in most modern languages – instead, keywords are reserved words and cannot be used as identifiers. Stropping allows the same letter sequence to be used both as a keyword and as an identifier, and simplifies parsing in that case – for example allowing a variable named if without clashing with the keyword if.

An affix grammar is a kind of formal grammar; it is used to describe the syntax of languages, mainly computer languages, using an approach based on how natural language is typically described.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes how to form strings from an alphabet of a formal language that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

<span class="mw-page-title-main">History of compiler construction</span>

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

In computer programming languages, an identifier is a lexical token that names the language's entities. Some of the kinds of entities an identifier might denote include variables, data types, labels, subroutines, and modules.

References

  1. Cleaveland, J. Craig; Uzgalis, Robert C. (1977). Grammars for Programming Languages. Elsevier. ISBN   978-0-444-00199-3.
  2. 1 2 van Wijngaarden, Adriaan (1972-04-04) [Premature and preliminary edition 1965-10-22]. MR 76: Orthogonal design and description of a formal language (PDF) (Technical report). Amsterdam: CWI. Archived from the original (PDF) on 2017-10-02.
  3. van Wijngaarden, A.; et al. (eds.). "Revised Report on the Algorithmic Language ALGOL 68". Archived from the original on 24 January 2002.
  4. Koster, C.H.A (1996). "The making of Algol 68". In Bjørner, D; Broy, M.; Pottosin, I.V. (eds.). Perspectives of System Informatics. Lecture Notes in Computer Science. Vol. 1181. Berlin: Springer. pp. 55–67. doi:10.1007/3-540-62064-8_6. ISBN   978-3-540-62064-8.
  5. Sintzoff, M. (1967). "Existence of van Wijngaarden syntax for every recursively enumerable set". Annales de la Société Scientifique de Bruxelles. 2: 115–118.
  6. Deransart, Pierre; Maluszynski, Jan (1993), "Grammatical Extensions of Logic Programs", A Grammatical View of Logic Programming, The MIT Press, pp. 109–140, doi:10.7551/mitpress/3345.003.0008, ISBN   9780262290845 , retrieved 2023-06-14
  7. Knuth, Donald E (1990), "The genesis of attribute grammars" (Plain TeX, gZiped), Proceedings of the International Conference on Attribute Grammars and Their Applications, Springer Verlag: 1–12.
  8. Sintzoff, M. (1967). "Existence of a van Wijngaarden syntax for every recursively enumerable set". Annales de la Société scientifique de Bruxelles. 81: 115–118.
  9. Augusto, L. M. (2023). "Two-level grammars: Some interesting properties of van Wijngaarden grammars" (PDF). Omega - Journal of Formal Languages. 1: 3–34.
  10. Backus, J.W.; et al. (1963). "Revised report on the algorithmic language ALGOL 60". The Computer Journal. 5 (4): 349–367. doi: 10.1093/comjnl/5.4.349 .
  11. "Syntax", Algol 68, FR: Univ Poitiers
  12. Fisher, Anthony, "yo-yo", Software, UK: York.
  13. Grune, Dick, A Two-Level Sentence Generator, NL: VU.
  14. Alblas, Henk; Melichar, Borivoj (1991). Attribute Grammars, Applications and Systems. Lecture Notes in Computer Science. Vol. 545. Springer. p. 371. ISBN   978-3540545729.
  15. Flowers, Roy, A W-grammar description for Ada (PDF) (Master thesis), Air Force Institute of Technology, Air University

Further reading