Netstring

Last updated

In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string. [1] [2]

Contents

Netstrings store the byte length of the data that follows, making it easier to unambiguously pass text and byte data between programs that could be sensitive to values that could be interpreted as delimiters or terminators (such as a null character).

The format consists of the string's length written using ASCII digits, followed by a colon, the byte data, and a comma. "Length" in this context means "number of 8-bit units", so if the string is, for example, encoded using UTF-8, this may or may not be identical to the number of textual characters that are present in the string.

For example, the text "hello world!" encodes as:

<31 323a68 65 6c 6c 6f 20 77 6f 72 6c 64 212c>

i.e.

12:hello world!,

And an empty string as:

<303a2c>

i.e.

0:,

The comma makes it slightly simpler for humans to read netstrings that are used as adjacent records, and provides weak verification of correct parsing. Note that without the comma, the format mirrors how Bencode encodes strings.

The length is written without leading zeroes. Empty string is the only netstring that begins with zero. There is exactly one legal netstring encoding for any byte string.

Since the format is easy to generate and to parse, it is easy to support by programs written in different programming languages. In practice, netstrings are often used to simplify exchange of bytestrings, or lists of bytestrings. For example, see its use in the Simple Common Gateway Interface (SCGI) and the Quick Mail Queuing Protocol (QMQP) .

Netstrings avoid complications that arise in trying to embed arbitrary data in delimited formats. For example, XML may not contain certain byte values and requires a nontrivial combination of escaping and delimiting, while generating multipart MIME messages involves choosing a delimiter that must not clash with the content of the data.

Netstrings can be stored recursively. The result of encoding a sequence of strings is a single string. Rewriting the above "hello world!" example to instead be a sequence of two netstrings, itself encoded as a single netstring, gives the following:

17:5:hello,6:world!,,

Parsing such a nested netstring is an example of duck typing, since the contained string ("5:hello,6:world!,") is both a string and a sequence of netstrings. Its effective type is determined by how the application chooses to interpret it, not by any explicit type declaration required by the netstring specification. In general, there are 3 ways that a program expecting a netstring may choose to interpret its contents:

Note that since netstrings pose no limitations on the contents of the data they store, netstrings can not be embedded verbatim in most delimited formats without the possibility of interfering with the delimiting of the containing format.

In the context of network programming it is potentially useful that the receiving program is informed of the size of the data that follows, as it can allocate exactly enough memory, avoid the need for reallocation to accommodate more data, and preemptively reject data that would exceed size limits.

See also

Notes and references

  1. defined in a document by D. J. Bernstein.
  2. See e.g. Python Web Programming By Steve Holden, David M. Beazley Published by Sams Publishing, 2002 ISBN   0-7357-1090-2, 978-0-7357-1090-0 691 pages, page 202.
  3. Caolan McMahon. "Bencoding".
  4. "TNetstrings Specification". Archived from the original on 2024-05-11.
  5. tnetstring-rb
  6. "tnetstring: the tagged netstring specification"
  7. "tnetstring: data serialization using typed netstrings"

Related Research Articles

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

YAML is a human-readable data serialization language. It is commonly used for configuration files and in applications where data are being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

<span class="mw-page-title-main">Flat-file database</span> Database stored as an ordinary unstructured file

A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file, or a binary file. Relationships can be inferred from the data in the database, but the database format itself does not make those relationships explicit.

Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

Snappy is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 "Westmere" 2.26 GHz Core i7 processor running in 64-bit mode. The compression ratio is 20–100% lower than gzip.

In the macOS, iOS, NeXTSTEP, and GNUstep programming frameworks, property list files are files that store serialized objects. Property list files use the filename extension .plist, and thus are often referred to as p-list files.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Action Message Format (AMF) is a binary format used to serialize object graphs such as ActionScript objects and XML, or send messages between an Adobe Flash client and a remote service, usually a Flash Media Server or third party alternatives. The Actionscript 3 language provides classes for encoding and decoding from the AMF format.

<span class="mw-page-title-main">Apache Avro</span> Open-source remote procedure call framework

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing and another which is more machine-readable based on JSON.

X.690 is an ITU-T standard specifying several ASN.1 encoding formats:

Data Format Description Language is a modeling language for describing general text and binary data in a standard way. It was published as an Open Grid Forum Recommendation in February 2021, and in April 2024 was published as an ISO standard.

Universal Binary JSON (UBJSON) is a computer data interchange format. It is a binary form directly imitating JSON, but requiring fewer bytes of data. It aims to achieve the generality of JSON, combined with being much easier to process than JSON.

JSON streaming comprises communications protocols to delimit JSON objects built upon lower-level stream-oriented protocols, that ensures individual JSON objects are recognized, when the server and clients use the same one. This is necessary as JSON is a non-concatenative protocol.

Concise Binary Object Representation (CBOR) is a binary data serialization format loosely based on JSON authored by Carsten Bormann and Paul Hoffman. Like JSON it allows the transmission of data objects that contain name–value pairs, but in a more concise manner. This increases processing and transfer speeds at the cost of human readability. It is defined in IETF RFC 8949.

JData is a light-weight data annotation and exchange open-standard designed to represent general-purpose and scientific data structures using human-readable (text-based) JSON and (binary) UBJSON formats. JData specification specifically aims at simplifying exchange of hierarchical and complex data between programming languages, such as MATLAB, Python, JavaScript etc. It defines a comprehensive list of JSON-compatible "name":value constructs to store a wide range of data structures, including scalars, N-dimensional arrays, sparse/complex-valued arrays, maps, tables, hashes, linked lists, trees and graphs, and support optional data grouping and metadata for each data element. The generated data files are compatible with JSON/UBJSON specifications and can be readily processed by most existing parsers. JData-defined annotation keywords also permit storage of strongly-typed binary data streams in JSON, data compression, linking and referencing.