S-expression

Last updated
Tree data structure representing the S-expression (* 2 (+ 3 4)) Corrected S-expression tree 2.svg
Tree data structure representing the S-expression (* 2 (+ 3 4))

In computer programming, an S-expression (or symbolic expression, abbreviated as sexpr or sexp) is an expression in a like-named notation for nested list (tree-structured) data. S-expressions were invented for and popularized by the programming language Lisp, which uses them for source code as well as data.

Contents

Characteristics

In the usual parenthesized syntax of Lisp, an S-expression is classically defined [1] as

  1. an atom of the form x, or
  2. an expression of the form (x . y) where x and y are S-expressions.

This definition reflects LISP's representation of a list as a series of "cells", each one an ordered pair. In plain lists, y points to the next cell (if any), thus forming a list. The recursive clause of the definition means that both this representation and the S-expression notation can represent any binary tree. However, the representation can in principle allow circular references, in which case the structure is not a tree at all, but a cyclic graph, and cannot be represented in classical S-expression notation unless a convention for cross-reference is provided (analogous to SQL foreign keys, SGML/XML IDREFs, etc.). Modern Lisp dialects such as Common Lisp [2] and Scheme [3] provide such syntax via datum labels, with which objects can be marked, which can then recur elsewhere, indicating shared rather than duplicated structure, enabling the reader or printer to detect and thus trigger evaluation or display of cycles without infinitely recursing

#n=(xy . #n#)

The definition of an atom varies per context; in the original definition by John McCarthy, [1] it was assumed that there existed "an infinite set of distinguishable atomic symbols" represented as "strings of capital Latin letters and digits with single embedded blanks" (a subset of character string and numeric literals).

Most modern sexpr notations allow more general quoted strings (for example including punctuation or full Unicode), and use an abbreviated notation to represent lists with more than 2 members, so that

(xyz)

stands for

(x . (y . (z . NIL)))

NIL is the special end-of-list object (alternatively written (), which is the only representation in Scheme [4] ).

In the Lisp family of programming languages, S-expressions are used to represent both source code and data. Other uses of S-expressions are in Lisp-derived languages such as DSSSL, and as mark-up in communication protocols like IMAP and John McCarthy's CBCL. It's also used as text representation of WebAssembly. The details of the syntax and supported data types vary in the different languages, but the most common feature among these languages is the use of S-expressions and prefix notation.

Datatypes and syntax

There are many variants of the S-expression format, supporting a variety of different syntaxes for different datatypes. The most widely supported are:

The character # is often used to prefix extensions to the syntax, e.g. #x10 for hexadecimal integers, or #\C for characters.

Use in Lisp

When representing source code in Lisp, the first element of an S-expression is commonly an operator or function name and any remaining elements are treated as arguments. This is called "prefix notation" or "Polish notation". As an example, the Boolean expression written 4 == (2 + 2) in C, is represented as (= 4 (+ 2 2)) in Lisp's s-expr-based prefix notation.

As noted above, the precise definition of "atom" varies across LISP-like languages. A quoted string can typically contain anything but a quote, while an unquoted identifier atom can typically contain anything but quotes, whitespace characters, parentheses, brackets, braces, backslashes, and semicolons. In either case, a prohibited character can typically be included by escaping it with a preceding backslash. Unicode support varies.

The recursive case of the s-expr definition is traditionally implemented using cons cells.

S-expressions were originally intended only for data to be manipulated by M-expressions, but the first implementation of Lisp was an interpreter of S-expression encodings of M-expressions, and Lisp programmers soon became accustomed to using S-expressions for both code and data. This means that Lisp is homoiconic; that is, the primary representation of programs is also a data structure in a primitive type of the language itself.

Nested lists can be written as S-expressions: ((milk juice) (honey marmalade)) is a two-element S-expression whose elements are also two-element S-expressions. The whitespace-separated notation used in Lisp (and this article) is typical. Line breaks (newline characters) usually qualify as separators. This is a simple context-free grammar for a tiny subset of English written as an S-expression (Gazdar/Melish, Natural Language Processing in Lisp), where S=sentence, NP=Noun Phrase, VP=Verb Phrase, V=Verb:

(((S)(NPVP))((VP)(V))((VP)(VNP))((V)died)((V)employed)((NP)nurses)((NP)patients)((NP)Medicenter)((NP)"Dr Chan"))

Program code can be written in S-expressions, usually using prefix notation. Example in Common Lisp:

(defunfactorial(x)(if(zeropx)1(*x(factorial(-x1)))))

S-expressions can be read in Lisp using the function READ. READ reads the textual representation of an S-expression and returns Lisp data. The function PRINT can be used to output an S-expression. The output then can be read with the function READ, when all printed data objects have a readable representation. Lisp has readable representations for numbers, strings, symbols, lists and many other data types. Program code can be formatted as pretty printed S-expressions using the function PPRINT (note: with two Ps, short for pretty-print).

Lisp programs are valid S-expressions, but not all S-expressions are valid Lisp programs. (1.0 + 3.1) is a valid S-expression, but not a valid Lisp program, since Lisp uses prefix notation and a floating point number (here 1.0) is not valid as an operation (the first element of the expression).

An S-expression preceded by a single quotation mark, as in 'x, is syntactic sugar for a quoted S-expression, in this case (quote x).

Parsing

S-expressions are often compared to XML: one key difference is that S-expressions have just one form of containment, the dotted pair, while XML tags can contain simple attributes, other tags, or CDATA, each using different syntax. Another is that S-expressions do not define a reference mechanism, where XML provides a notion of unique identifiers and references to them. For simple use cases, S-expressions are simpler than XML, but for more advanced use cases, XML has a query language so called XPath, many tools and third party libraries to simplify the handling of XML data.

Standardization

Standards for some Lisp-derived programming languages include a specification for their S-expression syntax. These include Common Lisp (ANSI standard document ANSI INCITS 226-1994 (R2004)), Scheme (R5RS and R6RS [5] ), and ISLISP.

In May 1997, Ron Rivest submitted an Internet Draft [6] to be considered for publication as an RFC. The draft defined a syntax based on Lisp S-expressions but intended for general-purpose data storage and exchange (similar to XML) rather than specifically for programming. It was never approved as an RFC, but it has since been cited and used by other RFCs (e.g. RFC 2693) and several other publications. [7] It was originally intended for use in SPKI.

Rivest's format defines an S-expression as being either an octet-string (a series of bytes) or a finite list of other S-expressions. It describes three interchange formats for expressing this structure. One is the "advanced transport", which is very flexible in terms of formatting, and is syntactically similar to Lisp-style expressions, but they are not identical. The advanced transport, for example, allows octet-strings to be represented verbatim (the string's length followed by a colon and the entire raw string), a quoted form allowing escape characters, hexadecimal, Base64, or placed directly as a "token" if it meets certain conditions. (Rivest's tokens differ from Lisp tokens in that the former are just for convenience and aesthetics, and treated exactly like other strings, while the latter have specific syntactical meaning.)

Rivest's draft defines a canonical representation "for digital signature purposes". It's intended to be compact, easier to parse, and unique for any abstract S-expression. It only allows verbatim strings, and prohibits whitespace as formatting outside strings. Finally there is the "basic transport representation", which is either the canonical form or the same encoded as Base64 and surrounded by braces, the latter intended to safely transport a canonically encoded S-expression in a system which might change spacing (e.g. an email system which has 80-character-wide lines and wraps anything longer than that).

This format has not been widely adapted for use outside of SPKI (some of the users being GnuPG, libgcrypt, Nettle, and GNU lsh). Rivest's S-expressions web page provides C source code for a parser and generator (available under the MIT license), which could be adapted and embedded into other programs. [8] In addition, there are no restrictions on independently implementing the format.

See also

Related Research Articles

<span class="mw-page-title-main">Dylan (programming language)</span> Multi-paradigm programming language

Dylan is a multi-paradigm programming language that includes support for functional and object-oriented programming (OOP), and is dynamic and reflective while providing a programming model designed to support generating efficient machine code, including fine-grained control over dynamic and static behaviors. It was created in the early 1990s by a group led by Apple Computer.

<span class="mw-page-title-main">Lisp (programming language)</span> Programming language family

Lisp is a family of programming languages with a long history and a distinctive, fully parenthesized prefix notation. Originally specified in 1960, it is the third-oldest high-level programming language still in common use, after Fortran and COBOL. Lisp has changed since its early days, and many dialects have existed over its history. Today, the best-known general-purpose Lisp dialects are Common Lisp, Scheme, Racket, and Clojure.

Rebol is a cross-platform data exchange language and a multi-paradigm dynamic programming language designed by Carl Sassenrath for network communications and distributed computing. It introduces the concept of dialecting: small, optimized, domain-specific languages for code and data, which is also the most notable property of the language according to its designer Carl Sassenrath:

Although it can be used for programming, writing functions, and performing processes, its greatest strength is the ability to easily create domain-specific languages or dialects

<span class="mw-page-title-main">Scheme (programming language)</span> Dialect of Lisp

Scheme is a dialect of the Lisp family of programming languages. Scheme was created during the 1970s at the MIT Computer Science and Artificial Intelligence Laboratory and released by its developers, Guy L. Steele and Gerald Jay Sussman, via a series of memos now known as the Lambda Papers. It was the first dialect of Lisp to choose lexical scope and the first to require implementations to perform tail-call optimization, giving stronger support for functional programming and associated techniques such as recursive algorithms. It was also one of the first programming languages to support first-class continuations. It had a significant influence on the effort that led to the development of Common Lisp.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programming language. They are extensions of the basic Backus–Naur form (BNF) metasyntax notation.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

Pretty-printing is the application of any of various stylistic formatting conventions to text files, such as source code, markup, and similar kinds of content. These formatting conventions may entail adhering to an indentation style, using different color and typeface to highlight syntactic elements of source code, or adjusting size, to make the content easier for people to read, and understand. Pretty-printers for source code are sometimes called code formatters or beautifiers.

<span class="mw-page-title-main">Human-readable medium and data</span> Presentation of data for humans to read

In computing, a human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans, resulting in human-readable data. It is often encoded as ASCII or Unicode text, rather than as binary data.

In some programming languages, eval, short for the English evaluate, is a function which evaluates a string as though it were an expression in the language, and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval. The input to eval is not necessarily a string; it may be structured representation of code, such as an abstract syntax tree, or of special type such as code. The analog for a statement is exec, which executes a string as if it were a statement; in some languages, such as Python, both are present, while in other languages only one of either eval or exec is.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer programming, homoiconicity is a property of some programming languages. A language is homoiconic if a program written in it can be manipulated as data using the language. The program's internal representation can thus be inferred just by reading the program itself. This property is often summarized by saying that the language treats code as data.

This comparison of programming languages compares the features of language syntax (format) for over 50 computer programming languages.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.

References

  1. 1 2 John McCarthy (1960/2006). Recursive functions of symbolic expressions Archived 2004-02-02 at the Wayback Machine . Originally published in Communications of the ACM.
  2. "Common Lisp HyperSpec: 22.4 - The Printer Dictionary: *PRINT-CIRCLE*". 2018-12-28.
  3. "Revised7 Report on the Algorithmic Language‌ Scheme: Section 2.4: Datum Labels" (PDF). 2013-07-06.
  4. "Revised^5 Report on the Algorithmic Language Scheme". schemers.org.
  5. Sperber, Michael; Dybvig, R. Kent; Flatt, Matthew; Van Straaten, Anton; Findler, Robby; Matthews, Jacob (Aug 12, 2009). "Revised6 Report on the Algorithmic Language Scheme". Journal of Functional Programming. 19 (S1): 1–301. CiteSeerX   10.1.1.372.373 . doi:10.1017/S0956796809990074. S2CID   267822156.
  6. S-expressions, Network Working Group, Internet Draft, Expires November 4, 1997 - R. Rivest, May 4, 1997 draft-rivest-sexp-00.txt, Ronald L. Rivest, CSAIL MIT website
  7. rivest sexp, Google Scholar (search)
  8. "SEXP (S-expressions)". people.csail.mit.edu. Archived from the original on 2023-02-23. Retrieved 2023-05-05.