Canonical S-expressions

Last updated

A Canonical S-expression (or csexp) is a binary encoding form of a subset of general S-expression (or sexp). It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Contents

The particular subset of general S-expressions applicable here is composed of atoms, which are byte strings, and parentheses used to delimit lists or sub-lists. These S-expressions are fully recursive.

While S-expressions are typically encoded as text, with spaces delimiting atoms and quotation marks used to surround atoms that contain spaces, when using the canonical encoding each atom is encoded as a length-prefixed byte string. No whitespace separating adjacent elements in a list is permitted. The length of an atom is expressed as an ASCII decimal number followed by a ":".

Example

The sexp

(this "Canonical S-expression" has 5 atoms)

becomes the csexp

(4:this22:Canonical S-expression3:has1:55:atoms)

No quotation marks are required to escape the space character internal to the atom "Canonical S-expression", because the length prefix clearly points to the end of the atom. There is no whitespace separating an atom from the next element in the list.

Properties

Interpretation and restrictions

While csexps generally permit empty lists, empty atoms, and so forth, certain uses of csexps impose additional restrictions. For example, csexps as used in SPKI have one limitation compared to csexps in general: every list must start with an atom, and therefore there can be no empty lists.

Typically, a list's first atom is treated as one treats an element name in XML.

Comparisons to other encodings

There are other encodings in common use:

  1. XML
  2. ASN.1
  3. JSON (and YAML that includes "JSON as an official subset", with the superset, meant to be more human-readable.)

Generally, csexp has a parser one or two decimal orders of magnitude smaller than that of either XML or ASN.1.[ citation needed ] This small size and corresponding speed[ citation needed ] give csexp its main advantage. In addition to the parsing advantage, there are other differences.

csexp vs. XML

csexp and XML differ in that csexp is a data-representation format, while XML includes a data-representation format and also a schema mechanism. Thus, XML can be "configured" for particular kinds of data, which conform to some grammar (say, HTML, ATOM, SVG, MathML, or new ones as needed). It has languages for defining document grammars: DTD is defined by the XML standard itself, while XSD, RelaxNG, and Schematron are commonly used with XML for additional features, and XML can also work with no schema. csexp data can of course be operated on by schemas implemented at a higher level, but provides no such mechanism itself.

In terms of characters and bytes, a csexp "string" may have any byte sequence whatsoever (because of the length prefix on each atom), while XML (like regular Lisp S-expressions, JSON, and literals in programming languages), requires alternate representations for a few characters (such as "<" and most control characters). This, however, has no effect on the range of structures and semantics that can be represented. XML also provides mechanisms to specify how a given byte sequence is intended to be interpreted: Say, as a Unicode UTF-8 string, a JPEG file, or an integer; csexp leaves such distinctions to external mechanisms.

At the most basic level, both csexp and XML represent trees (as do most other external representations). This is not surprising, since XML can be described as a differently-punctuated form for LISP-like S-expressions, or vice versa. [1]

However, XML includes additional semantics, which are commonly achieved in csexp by various conventions rather than as part of the language. First, every XML element has a name (csexp applications commonly use the first child of each expression for this). Second, XML provide datatyping, firstly via the schema grammar. A schema can also, however, distinguish integers, strings, data objects with types (e.g. JPEG) and (especially with XSD) other types).

An XML element may also have attributes, a construct that csexp does not share. To represent XML data in csexp, one must choose a representation for such attributes; an obvious one is to reserve the second item in each S-expression for a list of (name value) pairs, analogous to the LISP association list. The XML ID and IDREF attributes have no equivalent in csexp, but can be easily implemented by a csexp application program.

Finally, an XML element may contain comments and/or processing instructions. csexp has no specific equivalents, but they are trivial to represent, merely by reserving a name for each. For example, naming them "*COM" and "*PI" (the "*" prevents ever colliding with XML element type names):

(4:*COM15:Text of comment) (3:*PI6:target11:font="helv")

Both csexp and XML are fully recursive.

The first atom in a csexp list, by convention roughly corresponds to an XML element type name in identifying the "type" of the list. However, in csexp this can be any atom in any encoding (e.g., a JPEG, a Unicode string, a WAV file, …), while XML element names are identifiers, constrained to certain characters, like programming-language identifiers. csexp's method is obviously more general; on the other hand, Identifying what encoding such an item is in, and thus how to interpret it, is determined only by a particular user's conventions, meaning that a csexp application must build such conventions for itself, in code, documentation, and so forth.

Similarly, csexp atoms are binary (consisting of a length prefix followed by totally arbitrary bytes), while XML is designed to be human-readable (while arguably less so than JSON or YAML)  so arbitrary bytes in XML must be encoded somehow (for example, a bitmapped image can be included using base64). This means that storing large amounts of non-readable information in uncompressed XML takes more space; on the other hand, it will survive translation between alternate character sets (including transmission through network hosts that may apply differing character sets, line-end conventions, etc.).

It has been suggested that XML "merges" a sequence of strings within one element into a single string, while csexp allows a sequence of atoms within a list and those atoms remain separate from one another; but this is incorrect. [2] Exactly like S-expressions and csexp, XML has a notion of a "sequence of strings" only if the "strings" are separated somehow:

<s>StringA&lt;/s><s>StringB</s>versus <s>StringAStringB</s>
("String A" "String B")  versus ("String AString B")
(8:String A8:String B)  versus (16:String AString B)

csexp vs. ASN.1

ASN.1 is a popular binary encoding form. However, it expresses only syntax (data types), not semantics. Two different structures  each a SEQUENCE of two INTEGERS  have identical representations on the wire (barring special tag choices to distinguish them). To parse an ASN.1 structure, one must tell the parser what set of structures one is expecting and the parser must match the data type being parsed against the structure options. This adds to the complexity of an ASN.1 parser.

A csexp structure carries some indication of its own semantics (encoded in element names), and the parser for a csexp structure does not care what structure is being parsed. Once a wire-format expression has been parsed into an internal tree form (similar to XML's DOM), the consumer of that structure can examine it for conformance to what was expected. An XML document without a schema works just like csexp in this respect, while an XML document with them can work more like ASN.1.


Notes and references

  1. This similarity was known to the framers of XML. For example, Steven DeRose discussed it in The SGML FAQ Book: Understanding the Relationship of SGML and XML, Kluwer Academic Publishers, 1997. ISBN   978-0-7923-9943-8.
  2. The SAX interface to XML allows an XML parser to break up (single) text strings any way it likes. Some implementations[ citation needed ] (incorrectly) return multiple lines as single text nodes, which may have led to this common misunderstanding.

Related Research Articles

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of objects does not include any of their associated methods with which they were previously linked.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

<span class="mw-page-title-main">S-expression</span> Data serialization format

In computer programming, an S-expression is an expression in a like-named notation for nested list (tree-structured) data. S-expressions were invented for and popularized by the programming language Lisp, which uses them for source code as well as data.

Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data are being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

Fast Infoset is an international standard that specifies a binary encoding format for the XML Information Set as an alternative to the XML document format. It aims to provide more efficient serialization than the text-based XML format.

In the macOS, iOS, NeXTSTEP, and GNUstep programming frameworks, property list files are files that store serialized objects. Property list files use the filename extension .plist, and thus are often referred to as p-list files.

Within communication protocols, TLV is an encoding scheme used for informational elements. A TLV-encoded data stream contains code related to the record type, the record value's length, and finally the value itself.

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.

BSON is a computer data interchange format. The name "BSON" is based on the term JSON and stands for "Binary JSON". It is a binary form for representing simple or complex data structures including associative arrays, integer indexed arrays, and a suite of fundamental scalar types. BSON originated in 2009 at MongoDB. Several scalar data types are of specific interest to MongoDB and the format is used both as a data storage and network transfer format for the MongoDB database, but it can be used independently outside of MongoDB. Implementations are available in a variety of languages such as C, C++, C#, D, Delphi, Erlang, Go, Haskell, Java, JavaScript, Julia, Lua, OCaml, Perl, PHP, Python, Ruby, Rust, Scala, Smalltalk, and Swift.

This is a comparison of data serialization formats, various ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.

X.690 is an ITU-T standard specifying several ASN.1 encoding formats:

Data Format Description Language is a modeling language for describing general text and binary data in a standard way. It was published as an Open Grid Forum Recommendation in February 2021, and in April 2024 was published as an ISO standard.

Universal Binary JSON (UBJSON) is a computer data interchange format. It is a binary form directly imitating JSON, but requiring fewer bytes of data. It aims to achieve the generality of JSON, combined with being much easier to process than JSON.

JSONiq is a query and functional programming language that is designed to declaratively query and transform collections of hierarchical and heterogeneous data in format of JSON, XML, as well as unstructured, textual data.

Ion is a data serialization language developed by Amazon. It may be represented by either a human-readable text form or a compact binary form. The text form is a superset of JSON; thus, any valid JSON document is also a valid Ion document.