Higher-order abstract syntax

Last updated January 05, 2024

In computer science, higher-order abstract syntax (abbreviated HOAS) is a technique for the representation of abstract syntax trees for languages with variable binders.

Relation to first-order abstract syntax

An abstract syntax is abstract because it is represented by mathematical objects that have certain structure by their very nature. For instance, in first-order abstract syntax (FOAS) trees, as commonly used in compilers, the tree structure implies the subexpression relation, meaning that no parentheses are required to disambiguate programs (as they are, in the concrete syntax). HOAS exposes additional structure: the relationship between variables and their binding sites. In FOAS representations, a variable is typically represented with an identifier, with the relation between binding site and use being indicated by using the same identifier. With HOAS, there is no name for the variable; each use of the variable refers directly to the binding site.

There are a number of reasons why this technique is useful. First, it makes the binding structure of a program explicit: just as there is no need to explain operator precedence in a FOAS representation, there is no need to have the rules of binding and scope at hand to interpret a HOAS representation. Second, programs that are alpha-equivalent (differing only in the names of bound variables) have identical representations in HOAS, which can make equivalence checking more efficient.

Implementation

One mathematical object that could be used to implement HOAS is a graph where variables are associated with their binding sites via edges. Another popular way to implement HOAS (in, for example, compilers) is with de Bruijn indices.

Use in logic programming

The first programming language which directly supported λ-bindings in syntax was the higher-order logic programming language λProlog.^[1] The paper that introduced the term HOAS ^[2] used λProlog code to illustrate it. Unfortunately, when one transfers the term HOAS from the logic programming to the functional programming setting, that term implies the identification of bindings in syntax with functions over expressions. In this latter setting, HOAS has a different and problematic sense. The term λ-tree syntax has been introduced to refer specifically to the style of representation available in the logic programming setting.^[3]^[4] While different in detail, the treatment of bindings in λProlog is similar to their treatment in logical frameworks, elaborated in the next section.

Use in logical frameworks

In the domain of logical frameworks, the term higher-order abstract syntax is usually used to refer to a specific representation that uses the binders of the meta-language to encode the binding structure of the object language.

For instance, the logical framework LF has a λ-construct, which has arrow (→) type. As an example, consider we wanted to formalize a very primitive language with untyped expressions, a built-in set of variables, and a let construct (let <var> = <exp> in <exp'>), which allows to bind variables var with definition exp in expressions exp'. In Twelf syntax, we could do as follows:

<nowiki/>exp:type.var:type.v:var->exp.let:var->exp->exp->exp.

Here, exp is the type of all expressions and var the type of all built-in variables (implemented perhaps as natural numbers, which is not shown). The constant v acts as a casting function and witnesses the fact that variables are expressions. Finally, the constant let represents let constructs of the form let <var> = <exp> in <exp>: it accepts a variable, an expression (being bound by the variable), and another expression (that the variable is bound within).

The canonical HOAS representation of the same object language would be:

<nowiki/>exp:type.let:exp->(exp->exp)->exp.

In this representation, object level variables do not appear explicitly. The constant let takes an expression (that is being bound) and a meta-level function exp → exp (the body of the let). This function is the higher-order part: an expression with a free variable is represented as an expression with holes that are filled in by the meta-level function when applied. As a concrete example, we would construct the object level expression

<nowiki/>letx=1+2inx+3

(assuming the natural constructors for numbers and addition) using the HOAS signature above as

let(plus12)([y]plusy3)

where [y] e is Twelf's syntax for the function $\lambda y.e$ .

This specific representation has advantages beyond the ones above: for one, by reusing the meta-level notion of binding, the encoding enjoys properties such as type-preserving substitution without the need to define/prove them. In this way using HOAS can drastically reduce the amount of boilerplate code having to do with binding in an encoding.

Higher-order abstract syntax is generally only applicable when object language variables can be understood as variables in the mathematical sense (that is, as stand-ins for arbitrary members of some domain). This is often, but not always, the case: for instance, there are no advantages to be gained from a HOAS encoding of dynamic scope as it appears in some dialects of Lisp because dynamically scoped variables do not act like mathematical variables.

Related Research Articles

Lambda calculus is a formal system in mathematical logic for expressing computation based on function abstraction and application using variable binding and substitution. It is a universal model of computation that can be used to simulate any Turing machine. It was introduced by the mathematician Alonzo Church in the 1930s as part of his research into the foundations of mathematics.

Prolog is a logic programming language that has its origins in artificial intelligence and computational linguistics.

Scheme is a dialect of the Lisp family of programming languages. Scheme was created during the 1970s at the MIT Computer Science and Artificial Intelligence Laboratory and released by its developers, Guy L. Steele and Gerald Jay Sussman, via a series of memos now known as the Lambda Papers. It was the first dialect of Lisp to choose lexical scope and the first to require implementations to perform tail-call optimization, giving stronger support for functional programming and associated techniques such as recursive algorithms. It was also one of the first programming languages to support first-class continuations. It had a significant influence on the effort that led to the development of Common Lisp.

In logic and computer science, unification is an algorithmic process of solving equations between symbolic expressions. For example, using x,y,z as variables, the singleton equation set { cons(x,cons(x,nil)) = cons(2,y) } is a syntactic first-order unification problem that has the substitution { x ↦ 2, y ↦ cons(2,nil) } as its only solution.

In computer science, the abstract syntax of data is its structure described as a data type, independent of any particular representation or encoding. This is particularly used in the representation of text in computer languages, which are generally stored in a tree structure as an abstract syntax tree. Abstract syntax, which only consists of the structure of data, is contrasted with concrete syntax, which also includes information about the representation. For example, concrete syntax includes features like parentheses or commas, which are not included in the abstract syntax, as they are implicit in the structure.

In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow.

In computer programming, especially functional programming and type theory, an algebraic data type (ADT) is a kind of composite type, i.e., a type formed by combining other types.

System F is a typed lambda calculus that introduces, to simply typed lambda calculus, a mechanism of universal quantification over types. System F formalizes parametric polymorphism in programming languages, thus forming a theoretical basis for languages such as Haskell and ML. It was discovered independently by logician Jean-Yves Girard (1972) and computer scientist John C. Reynolds.

In logic, a logical framework provides a means to define a logic as a signature in a higher-order type theory in such a way that provability of a formula in the original logic reduces to a type inhabitation problem in the framework type theory. This approach has been used successfully for (interactive) automated theorem proving. The first logical framework was Automath; however, the name of the idea comes from the more widely known Edinburgh Logical Framework, LF. Several more recent proof tools like Isabelle are based on this idea. Unlike a direct embedding, the logical framework approach allows many logics to be embedded in the same type system.

Twelf is an implementation of the logical framework LF developed by Frank Pfenning and Carsten Schürmann at Carnegie Mellon University. It is used for logic programming and for the formalization of programming language theory.

In computing, a meta-circular evaluator (MCE) or meta-circular interpreter (MCI) is an interpreter which defines each feature of the interpreted language using a similar facility of the interpreter's host language. For example, interpreting a lambda application may be implemented using function application. Meta-circular evaluation is most prominent in the context of Lisp. A self-interpreter is a meta-circular interpreter where the interpreted language is nearly identical to the host language; the two terms are often used synonymously.

λProlog, also written lambda Prolog, is a logic programming language featuring polymorphic typing, modular programming, and higher-order programming. These extensions to Prolog are derived from the higher-order hereditary Harrop formulas used to justify the foundations of λProlog. Higher-order quantification, simply typed λ-terms, and higher-order unification gives λProlog the basic supports needed to capture the λ-tree syntax approach to higher-order abstract syntax, an approach to representing syntax that maps object-level bindings to programming language bindings. Programmers in λProlog need not deal with bound variable names: instead various declarative devices are available to deal with binder scopes and their instantiations.

In computer science, A-normal form is an intermediate representation of programs in functional programming language compilers. In ANF, all arguments to a function must be trivial. That is, evaluation of each argument must halt immediately.

In computer programming, an anonymous function is a function definition that is not bound to an identifier. Anonymous functions are often arguments being passed to higher-order functions or used for constructing the result of a higher-order function that needs to return a function. If the function is only used once, or a limited number of times, an anonymous function may be syntactically lighter than using a named function. Anonymous functions are ubiquitous in functional programming languages and other languages with first-class functions, where they fulfil the same role for the function type as literals do for other data types.

In mathematical logic, the De Bruijn index is a tool invented by the Dutch mathematician Nicolaas Govert de Bruijn for representing terms of lambda calculus without naming the bound variables. Terms written using these indices are invariant with respect to α-conversion, so the check for α-equivalence is the same as that for syntactic equality. Each De Bruijn index is a natural number that represents an occurrence of a variable in a λ-term, and denotes the number of binders that are in scope between that occurrence and its corresponding binder. The following are some examples:

In programming language semantics, normalisation by evaluation (NBE) is a method of obtaining the normal form of terms in the λ-calculus by appealing to their denotational semantics. A term is first interpreted into a denotational model of the λ-term structure, and then a canonical (β-normal and η-long) representative is extracted by reifying the denotation. Such an essentially semantic, reduction-free, approach differs from the more traditional syntactic, reduction-based, description of normalisation as reductions in a term rewrite system where β-reductions are allowed deep inside λ-terms.

The syntax and semantics of Prolog, a programming language, are the sets of rules that define how a Prolog program is written and how it is interpreted, respectively. The rules are laid out in ISO standard ISO/IEC 13211 although there are differences in the Prolog implementations.

Nominal terms are a metalanguage for embedding object languages with binding constructs into. Intuitively, they may be seen as an extension of first-order terms with support for name binding. Consequently, the native notion of equality between two nominal terms is alpha-equivalence. Nominal terms came out of a programme of research into nominal sets, and have a concrete semantics in those sets.

A Hindley–Milner (HM) type system is a classical type system for the lambda calculus with parametric polymorphism. It is also known as Damas–Milner or Damas–Hindley–Milner. It was first described by J. Roger Hindley and later rediscovered by Robin Milner. Luis Damas contributed a close formal analysis and proof of the method in his PhD thesis.

Dale Miller is an American computer scientist and author. He is a Director of Research at Inria Saclay and one of the designers of the λProlog programming language and the Abella interactive theorem prover.

References

↑ Dale Miller; Gopalan Nadathur (1987). A Logic Programming Approach to Manipulating Formulas and Programs (PDF). IEEE Symposium on Logic Programming. pp. 379–388.
↑ Frank Pfenning, Conal Elliott (1988). Higher-order abstract syntax (PDF). Proceedings of the ACM SIGPLAN PLDI '88. pp. 199–208. doi:10.1145/53990.54010. ISBN 0-89791-269-1.
↑ Dale Miller (2000). Abstract Syntax for Variable Binders: An Overview (PDF). Computational Logic - {CL} 2000. pp. 239–253. Archived from the original (PDF) on 2006-12-02.
↑ Miller, Dale (October 2019). "Mechanized metatheory revisited" (PDF). Journal of Automated Reasoning . 63 (3): 625–665. doi:10.1007/s10817-018-9483-3. S2CID 254605065.