JSON streaming

Last updated

JSON streaming comprises communications protocols to delimit JSON objects built upon lower-level stream-oriented protocols (such as TCP), that ensures individual JSON objects are recognized, when the server and clients use the same one (e.g. implicitly coded in). This is necessary as JSON is a non-concatenative protocol (the concatenation of two JSON objects does not produce a valid JSON object).

Contents

Introduction

JSON is a popular format for exchanging object data between systems. Frequently there's a need for a stream of objects to be sent over a single connection, such as a stock ticker or application log records. [1] In these cases there's a need to identify where one JSON encoded object ends and the next begins. Technically this is known as framing.

There are four common ways to achieve this:

Comparison

Line-delimited JSON works very well with traditional line-oriented tools.

Concatenated JSON works with pretty-printed JSON but requires more effort and complexity to parse. It doesn't work well with traditional line-oriented tools. Concatenated JSON streaming is a superset of line-delimited JSON streaming.

Length-prefixed JSON works with pretty-printed JSON. It doesn't work well with traditional line-oriented tools, but may offer performance advantages over line-delimited or concatenated streaming. It can also be simpler to parse.

Approaches

Newline-delimited JSON

Two terms for equivalent formats of line-delimited JSON are:

Streaming makes use of the fact that the JSON format does not allow return and newline characters within primitive values (in strings those must be escaped as \r and \n, respectively) and that most JSON formatters default to not including any whitespace, including returns and newlines. These features allow the newline character or return and newline character sequence to be used as a delimiter.

This example shows two JSON objects (the implicit newline characters at the end of each line are not shown):

{"some":"thing\n"}{"may":{"include":"nested","objects":["and","arrays"]}}

The use of a newline as a delimiter enables this format to work very well with traditional line-oriented Unix tools.

A log file, for example, might look like:

{"ts":"2020-06-18T10:44:12","started":{"pid":45678}}{"ts":"2020-06-18T10:44:13","logged_in":{"username":"foo"},"connection":{"addr":"1.2.3.4","port":5678}}{"ts":"2020-06-18T10:44:15","registered":{"username":"bar","email":"bar@example.com"},"connection":{"addr":"2.3.4.5","port":6789}}{"ts":"2020-06-18T10:44:16","logged_out":{"username":"foo"},"connection":{"addr":"1.2.3.4","port":5678}}

which is very easy to sort by date, grep for usernames, actions, IP addresses, etc.

Compatibility

Line-delimited JSON can be read by a parser that can handle concatenated JSON. Concatenated JSON that contains newlines within a JSON object can't be read by a line-delimited JSON parser.

The terms "line-delimited JSON" and "newline-delimited JSON" are often used without clarifying if embedded newlines are supported.

In the past the NDJ specification ("newline-delimited JSON") [7] allowed comments to be embedded if the first two characters of a given line were "//". This could not be used with standard JSON parsers if comments were included. Current version of the specification ("NDJSON - Newline delimited JSON ") [8] no longer includes comments.

Concatenated JSON can be converted into line-delimited JSON by a suitable JSON utility such as jq. For example

jq--compact-output.<concatenated.json>lines.json 

Record separator-delimited JSON

Record separator-delimited JSON streaming allows JSON text sequences to be delimited without the requirement that the JSON formatter exclude whitespace. Since JSON text sequences cannot contain control characters, a record separator character can be used to delimit the sequences. In addition, it is suggested that each JSON text sequence be followed by a line feed character to allow proper handling of top-level JSON objects that are not self delimiting (numbers, true, false, and null).

This format is also known as JSON Text Sequences or MIME type application/json-seq, and is formally described in IETF RFC 7464.

The example below shows two JSON objects with ␞ representing the record separator control character and ␊ representing the line feed character:

{"some":"thing\n"}{"may":{"include":"nested","objects":["and","arrays"]}}

Concatenated JSON

Concatenated JSON streaming allows the sender to simply write each JSON object into the stream with no delimiters. It relies on the receiver using a parser that can recognize and emit each JSON object as the terminating character is parsed. Concatenated JSON isn't a new format, it's simply a name for streaming multiple JSON objects without any delimiters.

The advantage of this format is that it can handle JSON objects that have been formatted with embedded newline characters, e.g., pretty-printed for human readability. For example, these two inputs are both valid and produce the same output:

{"some":"thing\n"}{"may":{"include":"nested","objects":["and","arrays"]}}
{"some":"thing\n"}{"may":{"include":"nested","objects":["and","arrays"]}}

Implementations that rely on line-based input may require a newline character after each JSON object in order for the object to be emitted by the parser in a timely manner. (Otherwise the line may remain in the input buffer without being passed to the parser.) This is rarely recognised as an issue because terminating JSON objects with a newline character is very common.

Length-prefixed JSON

Length-prefixed or framed JSON streaming allows the sender to explicitly state the length of each message. It relies on the receiver using a parser that can recognize each length n and then read the following n bytes to parse as JSON.

The advantage of this format is that it can speed up parsing due to the fact that the exact length of each message is explicitly stated, rather than forcing the parser to search for delimiters. Length-prefixed JSON is also well-suited for TCP applications, where a single "message" may be divided into arbitrary chunks, because the prefixed length tells the parser exactly how many bytes to expect before attempting to parse a JSON string.

This example shows two length-prefixed JSON objects (with each length being the byte-length of the following JSON string):

18{"some":"thing\n"}55{"may":{"include":"nested","objects":["and","arrays"]}}

Applications and tools

Newline-delimited JSON

Record separator-delimited JSON

Concatenated JSON

Length-prefixed JSON

Related Research Articles

<span class="mw-page-title-main">AWK</span> Programming language

AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for a large language model's data preprocessing, that encodes text into numerical tokens, using byte pair encoding.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

United Nations/Electronic Data Interchange for Administration, Commerce and Transport (UN/EDIFACT) is an international standard for electronic data interchange (EDI) developed for the United Nations and approved and published by UNECE, the UN Economic Commission for Europe.

Formats that use delimiter-separated values store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. Due to their wide support, DSV files can be used in data exchange among many applications.

Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a commonly used data format with diverse uses in electronic data interchange, including that of web applications with servers.

An INI file is a configuration file for computer software that consists of a text-based content with a structure and syntax comprising key–value pairs for properties, and sections that organize the properties. The name of these configuration files comes from the filename extension INI, for initialization, used in the MS-DOS operating system which popularized this method of software configuration. The format has become an informal standard in many contexts of configuration, but many applications on other operating systems use different file name extensions, such as conf and cfg.

.properties is a file extension for files mainly used in Java-related technologies to store the configurable parameters of an application. They can also be used for storing strings for Internationalization and localization; these are known as Property Resource Bundles.

JSONP, or JSON-P, is a historical JavaScript technique for requesting data by loading a <script> element, which is an element intended to load ordinary JavaScript. It was proposed by Bob Ippolito in 2005. JSONP enables sharing of data bypassing same-origin policy, which disallows running JavaScript code to read media DOM elements or XMLHttpRequest data fetched from outside the page's originating site. The originating site is indicated by a combination of URI scheme, hostname, and port number.

InfinityDB is an all-Java embedded database engine and client/server DBMS with an extended java.util.concurrent.ConcurrentNavigableMap interface that is deployed in handheld devices, on servers, on workstations, and in distributed settings. The design is based on a proprietary lockless, concurrent, B-tree architecture that enables client programmers to reach high levels of performance without risk of failures.

Data Format Description Language, published as an Open Grid Forum Recommendation in February 2021, is a modeling language for describing general text and binary data in a standard way. A DFDL model or schema allows any text or binary data to be read from its native format and to be presented as an instance of an information set.. The same DFDL schema also allows data to be taken from an instance of an information set and written out to its native format.

Universal Binary JSON (UBJSON) is a computer data interchange format. It is a binary form directly imitating JSON, but requiring fewer bytes of data. It aims to achieve the generality of JSON, combined with being much easier to process than JSON.

JData is a light-weight data annotation and exchange open-standard designed to represent general-purpose and scientific data structures using human-readable (text-based) JSON and (binary) UBJSON formats. JData specification specifically aims at simplifying exchange of hierarchical and complex data between programming languages, such as MATLAB, Python, JavaScript etc. It defines a comprehensive list of JSON-compatible "name":value constructs to store a wide range of data structures, including scalars, N-dimensional arrays, sparse/complex-valued arrays, maps, tables, hashes, linked lists, trees and graphs, and support optional data grouping and metadata for each data element. The generated data files are compatible with JSON/UBJSON specifications and can be readily processed by most existing parsers. JData-defined annotation keywords also permit storage of strongly-typed binary data streams in JSON, data compression, linking and referencing.

jq (programming language) Programming language for JSON

jq is a very high-level lexically scoped functional programming language in which every JSON value is a constant. jq supports backtracking and managing indefinitely long streams of JSON data. It is related to the Icon and Haskell programming languages. The language supports a namespace-based module system and has some support for closures. In particular, functions and functional expressions can be used as parameters of other functions.

References

  1. Ryan, Film Grain. "How We Built Filmgrain, Part 2 of 2". filmgrainapp.com. Retrieved 4 July 2013.
  2. "JSON Lines".
  3. Williams, N. (2015). "RFC 7464". Request for Comments . doi: 10.17487/RFC7464 .
  4. ndjson
  5. Update specification_draft2.md · ndjson/ndjson-spec@c658c26
  6. JSON Lines
  7. "Newline Delimited JSON". Jimbo JW. Archived from the original on 2015-12-22.
  8. "NDJSON - Newline delimited JSON". GitHub . 2 June 2021.
  9. "Centralized Logging with Monolog, Logstash, and Elasticsearch".
  10. "Package org.eclipse.rdf4j.rio.ndjsonld". Eclipse Foundation. Retrieved 1 May 2023.
  11. "Introduce RDFParser and RDFWriter implementation for Newline Delimited JSON-LD format". rdf4j Github repository. February 2021. Retrieved 1 May 2023.