Camlp4

Last updated

Camlp4 is a software system for writing extensible parsers for programming languages. It provides a set of OCaml libraries that are used to define grammars as well as loadable syntax extensions of such grammars. Camlp4 stands for Caml Preprocessor and Pretty-Printer and one of its most important applications was the definition of domain-specific extensions of the syntax of OCaml.

Contents

Camlp4 was part of the official OCaml distribution which is developed at the INRIA. Its original author is Daniel de Rauglaudre. OCaml version 3.10.0, released in May 2007, introduced a significantly modified and backward-incompatible version of Camlp4. De Rauglaudre maintains a separate backward-compatible version, which has been renamed Camlp5. All of the examples below are for Camlp5 or the previous version of Camlp4 (versions 3.09 and prior).

Version 4.08, released in the summer of 2019, [1] was the last official version of this library. It is currently deprecated; [2] instead, it is recommended to use the PPX (PreProcessor eXtensions) [3] [4] libraries. [5]

Concrete and abstract syntax

A Camlp4 preprocessor operates by loading a collection of compiled modules which define a parser as well as a pretty-printer: the parser converts an input program into an internal representation. This internal representation constitutes the abstract syntax tree (AST). It can be output in a binary form, e.g. it can be passed directly to one of the OCaml compilers, or it can be converted back into a clear text program. The notion of concrete syntax refers to the format in which the abstract syntax is represented.

For instance, the OCaml expression (1 + 2) can also be written ((+) 1 2) or (((+) 1) 2). The difference is only at the level of the concrete syntax, since these three versions are equivalent representations of the same abstract syntax tree. As demonstrated by the definition of a revised syntax for OCaml, the same programming language can use different concrete syntaxes. They would all converge to an abstract syntax tree in a unique format that a compiler can handle.

The abstract syntax tree is at the center of the syntax extensions, which are in fact OCaml programs. Although the definition of grammars must be done in OCaml, the parser that is being defined or extended is not necessarily related to OCaml, in which case the syntax tree that is being manipulated is not the one of OCaml. Several libraries are provided which facilitate the specific manipulation of OCaml syntax trees.

Fields of application

Domain-specific languages are a major application of Camlp4. Since OCaml is a multi-paradigm language, with an interactive toplevel and a native code compiler, it can be used as a backend for any kind of original language. The only thing that the developer has to do is write a Camlp4 grammar which converts the domain-specific language in question into a regular OCaml program. Other target languages can also be used, such as C.

If the target language is OCaml, simple syntax add-ons or syntactic sugar can be defined, in order to provide an expressivity which is not easy to achieve using the standard features of the OCaml language. A syntax extension is defined by a compiled OCaml module, which is passed to the camlp4o executable along with the program to process.

Camlp4 includes a domain-specific language as it provides syntax extensions which ease the development of syntax extensions. These extensions allow a compact definition of grammars (EXTEND statements) and quotations such as <:expr< 1 + 1 >>, i.e. deconstructing and constructing abstract syntax trees in concrete syntax.

Example

The following example defines a syntax extension of OCaml. It provides a new keyword, memo, which can be used as a replacement for function and provides automatic memoization of functions with pattern matching. Memoization consists in storing the results of previous computations in a table so that the actual computation of the function for each possible argument occurs at most once.

This is pa_memo.ml, the file which defines the syntax extension:

letunique=letn=ref0infun()->incrn;"__pa_memo"^string_of_int!nEXTENDGLOBAL:Pcaml.expr;Pcaml.expr:LEVEL"expr1"[["memo";OPT"|";pel=LIST1match_caseSEP"|"->lettbl=unique()inletx=unique()inletresult=unique()in<:expr<let$lid:tbl$=Hashtbl.create100infun$lid:x$->tryHashtbl.find$lid:tbl$$lid:x$with[Not_found->let$lid:result$=match$lid:x$with[$list:pel$]indo{Hashtbl.replace$lid:tbl$$lid:x$$lid:result$;$lid:result$}]>>]];match_case:[[p=Pcaml.patt;w=OPT["when";e=Pcaml.expr->e];"->";e=Pcaml.expr->(p,w,e)]];END

Example of program using this syntax extension:

letcounter=ref0(* global counter of multiplications *)(* factorial with memoization *)letrecfac=memo0->1|nwhenn>0->(incrcounter;n*fac(n-1))|_->invalid_arg"fac"letrunn=letresult=facninletcount=!counterinPrintf.printf"%i! = %i     number of multiplications so far = %i\n"nresultcountlet_=List.iterrun[5;4;6]

The output of the program is as follows, showing that the fac function (factorial) only computes products that were not computed previously:

5! = 120     number of multiplications so far = 5 4! = 24     number of multiplications so far = 5 6! = 720     number of multiplications so far = 6

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

ML is a functional programming language. It is known for its use of the polymorphic Hindley–Milner type system, which automatically assigns the data types of most expressions without requiring explicit type annotations, and ensures type safety; there is a formal proof that a well-typed ML program does not cause runtime type errors. ML provides pattern matching for function arguments, garbage collection, imperative programming, call-by-value and currying. While a general-purpose programming language, ML is used heavily in programming language research and is one of the few languages to be completely specified and verified using formal semantics. Its types and pattern matching make it well-suited and commonly used to operate on other formal languages, such as in compiler writing, automated theorem proving, and formal verification.

Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

In computer science, a preprocessor is a program that processes its input data to produce output that is used as input in another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers. The amount and kind of processing done depends on the nature of the preprocessor; some preprocessors are only capable of performing relatively simple textual substitutions and macro expansions, while others have the power of full-fledged programming languages.

An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are result of attribute evaluation rules associated with productions of the grammar. Attributes allow to transfer information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.

In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts, such as in simple mutually recursive descent parsing. It is a type of caching, distinct from other forms of caching such as buffering and page replacement. In the context of some logic programming languages, memoization is also known as tabling.

In computer-based language recognition, ANTLR, or ANother Tool for Language Recognition, is a parser generator that uses a LL(*) algorithm for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is Professor Terence Parr of the University of San Francisco.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as:

Each syntax equation is translated into a recursive subroutine which tests the input string for a particular phrase structure, and deletes it if found.

The Parser Grammar Engine is a compiler and runtime for Raku rules for the Parrot virtual machine. PGE uses these rules to convert a parsing expression grammar into Parrot bytecode. It is therefore compiling rules into a program, unlike most virtual machines and runtimes, which store regular expressions in a secondary internal format that is then interpreted at runtime by a regular expression engine. The rules format used by PGE can express any regular expression and most formal grammars, and as such it forms the first link in the compiler chain for all of Parrot's front-end languages.

Compiler Description Language (CDL) is a programming language based on affix grammars. It is very similar to Backus–Naur form (BNF) notation. It was designed for the development of compilers. It is very limited in its capabilities and control flow, and intentionally so. The benefits of these limitations are twofold.

Seed7 is an extensible general-purpose programming language designed by Thomas Mertes. It is syntactically similar to Pascal and Ada. Along with many other features, it provides an extension mechanism. Seed7 supports introducing new syntax elements and their semantics into the language, and allows new language constructs to be defined and written in Seed7. For example, programmers can introduce syntax and semantics of new statements and user defined operator symbols. The implementation of Seed7 differs significantly from that of languages with hard-coded syntax and semantics.

Tcl is a high-level, general-purpose, interpreted, dynamic programming language. It was designed with the goal of being very simple but powerful. Tcl casts everything into the mold of a command, even programming constructs like variable assignment and procedure definition. Tcl supports multiple programming paradigms, including object-oriented, imperative, functional, and procedural styles.

OMeta is a specialized object-oriented programming language for pattern matching, developed by Alessandro Warth and Ian Piumarta in 2007 under the Viewpoints Research Institute. The language is based on parsing expression grammar (PEGs) rather than context-free grammar with the intent of providing "a natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree-transformers".

<span class="mw-page-title-main">Nim (programming language)</span> Programming language

Nim is a general-purpose, multi-paradigm, statically typed, compiled high-level systems programming language, designed and developed by a team around Andreas Rumpf. Nim is designed to be "efficient, expressive, and elegant", supporting metaprogramming, functional, message passing, procedural, and object-oriented programming styles by providing several features such as compile time code generation, algebraic data types, a foreign function interface (FFI) with C, C++, Objective-C, and JavaScript, and supporting compiling to those same languages as intermediate representations.

References

  1. "ocaml/camlp4". GitHub. Retrieved 2020-02-04.
  2. Dimino, Jérémie (2019-08-07). "The end of Camlp4". OCaml. Archived from the original on 2020-02-04. Retrieved 2020-02-04.
  3. "PPX". ocamllabs.io. Retrieved 2020-02-04.
  4. Metzger, Perry. "A Guide to PreProcessor eXtensions". OCamlverse. Archived from the original on 2020-02-05. Retrieved 2020-02-05.
  5. Dimino, Jeremie. "Converting a code base from camlp4 to ppx". Jane Street Tech Blog. Archived from the original on 2020-02-04. Retrieved 2020-02-04.