Netstring

Last updated

In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string. [1] [2]

Contents

Netstrings store the byte length of the data that follows, making it easier to unambiguously pass text and byte data between programs that could be sensitive to values that could be interpreted as delimiters or terminators (such as a null character).

The format consists of the string's length written using ASCII digits, followed by a colon, the byte data, and a comma. "Length" in this context means "number of 8-bit units", so if the string is, for example, encoded using UTF-8, this may or may not be identical to the number of textual characters that are present in the string.

For example, the text "hello world!" encodes as:

<31 323a68 65 6c 6c 6f 20 77 6f 72 6c 64 212c>

i.e.

12:hello world!,

And an empty string as:

<303a2c>

i.e.

0:,

The comma makes it slightly simpler for humans to read netstrings that are used as adjacent records, and provides weak verification of correct parsing. Note that without the comma, the format mirrors how Bencode encodes strings.

The length is written without leading zeroes. Empty string is the only netstring that begins with zero. There is exactly one legal netstring encoding for any byte string.

Since the format is easy to generate and to parse, it is easy to support by programs written in different programming languages. In practice, netstrings are often used to simplify exchange of bytestrings, or lists of bytestrings. For example, see its use in the Simple Common Gateway Interface (SCGI) and the Quick Mail Queuing Protocol (QMQP) .

Netstrings avoid complications that arise in trying to embed arbitrary data in delimited formats. For example, XML may not contain certain byte values and requires a nontrivial combination of escaping and delimiting, while generating multipart MIME messages involves choosing a delimiter that must not clash with the content of the data.

Netstrings can be stored recursively. The result of encoding a sequence of strings is a single string. Rewriting the above "hello world!" example to instead be a sequence of two netstrings, itself encoded as a single netstring, gives the following:

17:5:hello,6:world!,,

Parsing such a nested netstring is an example of duck typing, since the contained string ("5:hello,6:world!,") is both a string and a sequence of netstrings. Its effective type is determined by how the application chooses to interpret it, not by any explicit type declaration required by the netstring specification. In general, there are 3 ways that a program expecting a netstring may choose to interpret its contents:

Note that since netstrings pose no limitations on the contents of the data they store, netstrings can not be embedded verbatim in most delimited formats without the possibility of interfering with the delimiting of the containing format.

In the context of network programming it is potentially useful that the receiving program is informed of the size of the data that follows, as it can allocate exactly enough memory, avoid the need for reallocation to accommodate more data, and preemptively reject data that would exceed size limits.

See also

Notes and references

  1. defined in a document by D. J. Bernstein.
  2. See e.g. Python Web Programming By Steve Holden, David M. Beazley Published by Sams Publishing, 2002 ISBN   0-7357-1090-2, 978-0-7357-1090-0 691 pages, page 202.
  3. Caolan McMahon. "Bencoding".
  4. "TNetstrings Specification". Archived from the original on 2014-02-10.
  5. tnetstring-rb
  6. "tnetstring: the tagged netstring specification"
  7. "tnetstring: data serialization using typed netstrings"

Related Research Articles

String (computer science) Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computing, serialization is the process of translating data structures or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

UTF-8 Unicode Transformation Format 8, encodes all 1,112,064 Unicode code points as 1 to 4 bytes

UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike. The name is derived from UnicodeTransformation Format – 8-bit.

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Abstract Syntax Notation One (ASN.1) is a standard interface description language for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

A string literal or anonymous string is a type of literal in programming for the representation of a string value within the source code of a computer program. Most often in modern languages this is a quoted sequence of characters, as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value, and one must use a method such as escape sequences to avoid the problem of delimiter collision and allow the delimiters themselves to be embedded in a string. However, there are numerous alternate notations for specifying string literals, particularly more complicated cases, and the exact notation depends on the individual programming language in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

YAML is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from SGML . It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps making YAML 1.2 a superset of JSON.

In computer science, primitive data type is either of the following:

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data in plain text, in which case each line will have the same number of fields.

Delimiter Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

A Java class file is a file containing Java bytecode that can be executed on the Java Virtual Machine (JVM). A Java class file is usually produced by a Java compiler from Java programming language source files containing Java classes. If a source file has more than one class, each class is compiled into a separate class file.

Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.

JSON Text-based open standard designed for human-readable data interchange

JavaScript Object Notation is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types. It is a very common data format, with a diverse range of applications, such as serving as a replacement for XML in AJAX systems.

The Simple Common Gateway Interface (SCGI) is a protocol for applications to interface with HTTP servers, as an alternative to the CGI protocol. It is similar to FastCGI but is designed to be easier to parse. Unlike CGI, it permits a long-running service process to continue serving requests, thus avoiding delays in responding to requests due to setup overhead.

In the macOS, iOS, NeXTSTEP, and GNUstep programming frameworks, property list files are files that store serialized objects. Property list files use the filename extension .plist, and thus are often referred to as p-list files.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Action Message Format (AMF) is a binary format used to serialize object graphs such as ActionScript objects and XML, or send messages between an Adobe Flash client and a remote service, usually a Flash Media Server or third party alternatives. The Actionscript 3 language provides classes for encoding and decoding from the AMF format.

Data Format Description Language, published as an Open Grid Forum Proposed Recommendation in January 2011, is a modeling language for describing general text and binary data in a standard way. A DFDL model or schema allows any text or binary data to be read from its native format and to be presented as an instance of an information set.. The same DFDL schema also allows data to be taken from an instance of an information set and written out to its native format.

Universal Binary JSON (UBJSON) is a computer data interchange format. It is a binary form directly imitating JSON, but requiring fewer bytes of data. It aims to achieve the generality of JSON, combined with being much easier to process than JSON.

JSON streaming comprises communications protocols to delimit JSON objects built upon lower-level stream-oriented protocols, that ensures individual JSON objects are recognized, when the server and clients use the same one.