Tab-separated values

Last updated
Tab-separated values
TsvDelimited001.svg
Filename extension .tsv, .tab [1]
Internet media type
text/tab-separated-values
Uniform Type Identifier (UTI) public.tab-separated-values-text [2]
UTI conformation public.delimited-values-text [2]
Developed by University of Minnesota Internet Gopher Team

Internet Assigned Numbers Authority
Initial releasec.June 1993;30 years ago (1993-06)
Type of format Delimiter-separated values format
Container for database information organized as field separated lists
Standard IANA MIME type

Tab-separated values (TSV) is a simple, text-based file format for storing tabular data. [3] Records are separated by newlines, and values within a record are separated by tab characters. The TSV format is thus a delimiter-separated values format, similar to comma-separated values.

Contents

TSV is a simple file format that is widely supported, so it is often used in data exchange to move tabular data between different computer programs that support the format. For example, a TSV file might be used to transfer information from a database to a spreadsheet.

Example

The head of the Iris flower data set can be stored as a TSV using the following plain text (note that the HTML rendering may convert tabs to spaces):

Sepal length Sepal width Petal length Petal width Species 5.1 3.5 1.4 0.2 I. setosa 4.9 3.0 1.4 0.2 I. setosa 4.7 3.2 1.3 0.2 I. setosa 4.6 3.1 1.5 0.2 I. setosa 5.0 3.6 1.4 0.2 I. setosa 

The TSV plain text above corresponds to the following tabular data:

Sepal lengthSepal widthPetal lengthPetal widthSpecies
5.13.51.40.2I. setosa
4.93.01.40.2I. setosa
4.73.21.30.2I. setosa
4.63.11.50.2I. setosa
5.03.61.40.2I. setosa

Character escaping

The IANA media type standard for TSV achieves simplicity by simply disallowing tabs within fields. [4]

Since the values in the TSV format cannot contain literal tabs or newline characters, a convention is necessary for lossless conversion of text values with these characters. A common convention is to perform the following escapes: [5] [6]

escape sequencemeaning
\n line feed
\t tab
\r carriage return
\\ backslash

Another common convention is to use the CSV convention from RFC   4180 and enclose values containing tabs or newlines in double quotes. This can lead to ambiguities. [7] [8]

Another ambiguity is whether records are separated by a line feed, as is typical for Unix platforms, or a carriage return and line feeds, as is typical for Microsoft platforms. [9] Many programs such as LibreOffice expect a carriage return followed by a newline.

See also

Related Research Articles

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

Newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

YAML is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from Standard Generalized Markup Language (SGML). It uses both Python-style indentation to indicate nesting, and a more compact format that uses [...] for lists and {...} for maps but forbids tab characters to use as indentation thus only some JSON files are valid YAML 1.2.

<span class="mw-page-title-main">Tab key</span> Key on a keyboard for tabulation

The tab keyTab ↹ on a keyboard is used to advance the cursor to the next tab stop.

In computer programming, a magic number is any of the following:

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data in plain text, in which case each line will have the same number of fields.

<span class="mw-page-title-main">Delimiter</span> Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

<span class="mw-page-title-main">Flat-file database</span> Database stored as an ordinary unstructured file

A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file, or a binary file. Relationships can be inferred from the data in the database, but the database format itself does not make those relationships explicit.

paste is a Unix command line utility which is used to join files horizontally by outputting lines consisting of the sequentially corresponding lines of each file specified, separated by tabs, to the standard output.

Formats that use delimiter-separated values store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. Due to their wide support, DSV files can be used in data exchange among many applications.

<span class="mw-page-title-main">Windows Registry</span> Database for Microsoft Windows

The Windows Registry is a hierarchical database that stores low-level settings for the Microsoft Windows operating system and for applications that opt to use the registry. The kernel, device drivers, services, Security Accounts Manager, and user interfaces can all use the registry. The registry also allows access to counters for profiling system performance.

This article provides basic comparisons for notable text editors. More feature details for text editors are available from the Category of text editor features and from the individual products' articles. This article may not be up-to-date or necessarily all-inclusive.

<span class="mw-page-title-main">JSON</span> Open standard file format and data interchange

JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers.

An INI file is a configuration file for computer software that consists of a text-based content with a structure and syntax comprising key–value pairs for properties, and sections that organize the properties. The name of these configuration files comes from the filename extension INI, for initialization, used in the MS-DOS operating system which popularized this method of software configuration. The format has become an informal standard in many contexts of configuration, but many applications on other operating systems use different file name extensions, such as conf and cfg.

In mathematics, Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick's restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein's PHYLIP package.

SHAZAM is a comprehensive econometrics and statistics package for estimating, testing, simulating and forecasting many types of econometrics and statistical models. SHAZAM was originally created in 1977 by Kenneth White.

<span class="mw-page-title-main">Robot Framework</span>

Robot Framework is a generic test automation framework for acceptance testing and acceptance test-driven development (ATDD). It is a keyword-driven testing framework that uses tabular test data syntax.

pandas (software) Python library for data analysis

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself. Wes McKinney started building what would become pandas at AQR Capital while he was a researcher there from 2007 to 2010.

JSON streaming comprises communications protocols to delimit JSON objects built upon lower-level stream-oriented protocols, that ensures individual JSON objects are recognized, when the server and clients use the same one. This is necessary as JSON is a non-concatenative protocol.

References

  1. U of Edin. Research Data Support Team. "Choose the best file formats". University of Edinburgh. § Formats we recommend. Retrieved 23 May 2023.
  2. 1 2 "tabSeparatedText". Apple Developer Documentation: Uniform Type Identifiers. Apple Inc . Retrieved 23 May 2023.
  3. "How To Use Tab Separated Value (TSV) files". International Monetary Fund. Retrieved 1 February 2023.
  4. Lindner 1993.
  5. Dusek, Jason (6 May 2014). "Linear TSV: simple, line-oriented, tabular data". Data Protocols - Open Knowledge Foundation (v1.0β ed.).
  6. Dolan, Stephen (1 November 2018). "jq Manual". jq. Retrieved 23 May 2023.
  7. Miller, Rob (22 September 2015). Text Processing with Ruby: Extract Value from the Data That Surrounds You. Pragmatic Bookshelf. p. 94. ISBN   978-1-68050-492-7.
  8. Giuseppini, Gabriele; Burnett, Mark (10 February 2005). Microsoft Log Parser Toolkit: A Complete Toolkit for Microsoft's Undocumented Log Analysis Tool. Elsevier. p. 311. ISBN   978-0-08-048939-1.
  9. Library of Congress, ID № fdd000533.

Further reading

Standards

Informational