Comma-separated values

Last updated
Comma-separated values
CsvDelimited001.svg
Filename extension .csv
Internet media type text/csv [1]
Initial releaseUnknown
Informational RFC Oct 2005 [2]
Type of formatmulti-platform, serial data streams
Container for database information organized as field separated lists
Standard RFC 4180

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

Contents

The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences.

In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters, for example, semicolons. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used in some European countries, such as Italy, instead of commas.

Data exchange

CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data [3] [4] between programs that natively operate on incompatible (often proprietary or undocumented) formats. [1] This works despite lack of adherence to RFC 4180 (or any other standard), because so many programs support variations on the CSV format for data import.

For example, a user may need to transfer information from a database program that stores data in a proprietary format, to a spreadsheet that uses a completely different format. The database program most likely can export its data as "CSV"; the exported CSV file can then be imported by the spreadsheet program.

Specification

RFC   4180 proposes a specification for the CSV format; however, actual practice often does not follow the RFC and the term "CSV" might refer to any file that: [2] [5]

  1. is plain text using a character set such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS,
  2. consists of records (typically one record per line),
  3. with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
  4. where every record has the same sequence of fields.

Within these general constraints, many variations are in use. Therefore, without additional information (such as whether RFC 4180 is honored), a file claimed simply to be in "CSV" format is not fully specified. As a result, many applications supporting CSV files allow users to preview the first few lines of the file and then specify the delimiter character(s), quoting rules, etc. If a particular CSV file's variations fall outside what a particular receiving program supports, it is often feasible to examine and edit the file by hand (i.e., with a text editor) or write a script or program to produce a conforming format.

History

Comma-separated values is a data format that pre-dates personal computers by more than a decade: the IBM Fortran (level H extended) compiler under OS/360 supported them in 1972. [6] List-directed ("free form") input/output was defined in FORTRAN 77, approved in 1978. List-directed input used commas or spaces for delimiters, so unquoted character strings could not contain commas or spaces. [7]

The "comma-separated value" name and "CSV" abbreviation were in use by 1983. [8] The manual for the Osborne Executive computer, which bundled the SuperCalc spreadsheet, documents the CSV quoting convention that allows strings to contain embedded commas, but the manual does not specify a convention for embedding quotation marks within quoted strings. [9]

Comma-separated value lists are easier to type (for example into punched cards) than fixed-column-aligned data, and were less prone to producing incorrect results if a value was punched one column off from its intended location.

Comma separated files are used for the interchange of database information between machines of two different architectures. The plain-text character of CSV files largely avoids incompatibilities such as byte-order and word size. The files are largely human-readable, so it is easier to deal with them in the absence of perfect documentation or communication. [10]

The main standardization initiative—transforming " de facto fuzzy definition" into a more precise and de jure one—was in 2005, with RFC4180, defining CSV as a MIME Content Type. Later, in 2013, some of RFC4180's deficiencies were tackled by a W3C recommendation. [11]

In 2014 IETF published RFC7111 describing application of URI fragments to CSV documents. RFC7111 specifies how row, column, and cell ranges can be selected from a CSV document using position indexes.

In 2015 W3C, in an attempt to enhance CSV with formal semantics, publicized the first drafts of recommendations for CSV-metadata standards, that began as recommendations in December of the same year. [12]

General functionality

CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single relation in a relational database, or to data (though not calculations) in a typical spreadsheet.

The format dates back to the early days of business computing and is widely used to pass data between computers with different internal word sizes, data formatting needs, and so forth. For this reason, CSV files are common on all computer platforms.

CSV is a delimited text file that uses a comma to separate values (many implementations of CSV import/export tools allow other separators to be used; for example, the use of a "Sep=^" row as the first row in the *.csv file will cause Excel to open the file expecting caret "^" to be the separator instead of comma ","). Simple CSV implementations may prohibit field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit them, often by requiring " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or less commonly, newlines). Embedded double quote characters may then be represented by a pair of consecutive double quotes, [13] or by prefixing a double quote with an escape character such as a backslash (for example in Sybase Central).

CSV formats are not limited to a particular character set. [1] They work just as well with Unicode character sets (such as UTF-8 or UTF-16) as with ASCII (although particular programs that support CSV may have their own limitations). CSV files normally will even survive naive translation from one character set to another (unlike nearly all proprietary data formats). CSV does not, however, provide any way to indicate what character set is in use, so that must be communicated separately, or determined at the receiving end (if possible).

Databases that include multiple relations cannot be exported as a single CSV file[ citation needed ]. Similarly, CSV cannot naturally represent hierarchical or object-oriented data. This is because every CSV record is expected to have the same structure. CSV is therefore rarely appropriate for documents created with HTML, XML, or other markup or word-processing technologies.

Statistical databases in various fields often have a generally relation-like structure, but with some repeatable groups of fields. For example, health databases such as the Demographic and Health Survey typically repeat some questions for each child of a given parent (perhaps up to a fixed maximum number of children). Statistical analysis systems often include utilities that can "rotate" such data; for example, a "parent" record that includes information about five children can be split into five separate records, each containing (a) the information on one child, and (b) a copy of all the non-child-specific information. CSV can represent either the "vertical" or "horizontal" form of such data.

In a relational database, similar issues are readily handled by creating a separate relation for each such group, and connecting "child" records to the related "parent" records using a foreign key (such as an ID number or name for the parent). In markup languages such as XML, such groups are typically enclosed within a parent element and repeated as necessary (for example, multiple <child> nodes within a single <parent> node). With CSV there is no widely accepted single-file solution.

Standardization

The name "CSV" indicates the use of the comma to separate data fields. Nevertheless, the term "CSV" is widely used to refer a large family of formats, which differ in many ways. Some implementations allow or require single or double quotation marks around some or all fields; and some reserve the very first record as a header containing a list of field names. The character set being used is undefined: some applications require a Unicode byte order mark (BOM) to enforce Unicode interpretation (sometimes even a UTF-8 BOM). [1] Files that use the tab character instead of comma can be more precisely referred to as "TSV" for tab-separated values.

Other implementation differences include handling of more commonplace field separators (such as space or semicolon) and newline characters inside text fields. One more subtlety is the interpretation of a blank line: it can equally be the result of writing a record of zero fields, or a record of one field of zero length; thus decoding it is ambiguous.

OKI frictionless tabular data package

In 2011 Open Knowledge International (OKI) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata (CSV lacks any type information to distinguish the string "1" from the number 1). An initial v1 of Tabular Data Package was released in 2015, and after extensive real-world testing and tool development, v1 of a CSV-based Tabular Data Package was officially released in September 2017. [14] The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.

Internet W3C tabular data standard

In 2013 the W3C "CSV on the Web" working group began to specify technologies providing a higher interoperability for web applications using CSV or similar formats. [15] The working group completed its work in February 2016, and is officially closed in March 2016 with the release of a set documents and W3C recommendations [16] for modeling "Tabular Data", [17] and enhancing CSV with metadata and semantics.

RFC 4180 standard

The technical standard RFC 4180 formalizes the CSV file format and defines the MIME type "text/csv" for handling of text-based fields. However, interpretation of the text of each field is still application-specific. Files that follow the RFC 4180 standard can simplify CSV exchange and should be widely portable. Among its requirements:

The format can be processed by most programs that claim to read CSV files. The exceptions are (a) programs may not support line-breaks within quoted fields, (b) programs may confuse the optional header with data or interpret the first data line as an optional header and (c) double quotes in a field may not be parsed correctly automatically.

Basic rules

Many informal documents exist that describe "CSV" formats. IETF RFC 4180 (summarized above) defines the format for the "text/csv" MIME type registered with the IANA.

Rules typical of these and other "CSV" specifications and implementations are as follows:

Example

YearMakeModelDescriptionPrice
1997FordE350ac, abs, moon3000.00
1999ChevyVenture "Extended Edition"4900.00
1999ChevyVenture "Extended Edition, Very Large"5000.00
1996JeepGrand CherokeeMUST SELL!
air, moon roof, loaded
4799.00

The above table of data may be represented in CSV format as follows:

Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00

Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma):

Year,Make,Model,Length 1997,Ford,E350,2.35 2000,Mercury,Cougar,2.38

Example of an analogous European CSV/DSV file (where the decimal separator is a comma and the value separator is a semicolon):

Year;Make;Model;Length 1997;Ford;E350;2,35 2000;Mercury;Cougar;2,38

The latter format is not RFC 4180 compliant. [18] Compliance could be achieved by the use of a comma instead of a semicolon as a separator and either the international notation for the representation of the decimal mark or the practice of quoting all numbers that have a decimal mark.

Application support

The CSV file format is supported by almost all spreadsheets and database management systems, including Microsoft Excel, Apple Numbers, LibreOffice Calc, and Apache OpenOffice Calc.

CSV format is supported by libraries available for many programming languages. Most provide some way to specify the field delimiter, decimal separator, character encoding, quoting conventions, date format, etc.

The emacs editor can operate on CSV files using csv-nav mode. [19]

Many utility programs on Unix-style systems (such as cut, paste, join, sort, uniq, awk) can split files on a comma delimiter, and can therefore process simple CSV files. However, this method does not correctly handle commas within quoted strings.

See also

Related Research Articles

AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

S-expression data serialization format

In computing, s-expressions, sexprs or sexps are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data. In the usual parenthesized syntax of Lisp, an s-expression is classically defined as

  1. an atom, or
  2. an expression of the form (x. y) where x and y are s-expressions.

The semicolon or semi-colon (;) is a punctuation mark that separates major sentence elements. A semicolon can be used between two closely related independent clauses, provided they are not already joined by a coordinating conjunction. Semicolons can also be used in place of commas to separate the items in a list, particularly when the elements of that list contain commas.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters into a sequence of tokens. A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

A string literal or anonymous string is a type of literal in programming for the representation of a string value within the source code of a computer program. Most often in modern languages this is a quoted sequence of characters, as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value, and one must use a method such as escape sequences to avoid the problem of delimiter collision and allow the delimiters themselves to be embedded in a string. However, there are numerous alternate notations for specifying string literals, particularly more complicated cases, and the exact notation depends on the individual programming language in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

In computer science, Base64 is a group of binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. The term Base64 originates from a specific MIME content transfer encoding. Each Base64 digit represents exactly 6 bits of data. Three 8-bit bytes can therefore be represented by four 6-bit Base64 digits.

YAML is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax which intentionally differs from SGML . It uses both Python-style indentation to indicate nesting, and a more compact format that uses [] for lists and {} for maps making YAML 1.2 a superset of JSON.

Tab key Key on a keyboard for tabulation

The tab keyTab ↹ on a keyboard is used to advance the cursor to the next tab stop.

Delimiter Characters that specify the boundary between regions in a data stream

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

Flat-file database database stored as an ordinary unstructured file

A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file, or a binary file. Relationships can be inferred from the data in the database, but the database format itself does not make those relationships explicit.

On the World Wide Web, a query string is the part of a uniform resource locator (URL) which assigns values to specified parameters. The query string commonly includes fields added to a base URL by a Web browser or other client application, for example as part of an HTML form.

Formats that use delimiter-separated values store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. Due to their wide support, DSV files can be used in data exchange among many applications.

The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in Web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests, and used by several browser extensions to package images as well as other multimedia contents in a single HTML file for page saving. As of 2015, data URIs are fully supported by most major browsers, and partially supported in Internet Explorer and Microsoft Edge.

NOV, or News Overview, is a widely deployed indexing method for Usenet articles, also found in some Internet email implementations. Written in 1992 by Geoff Collyer, NOV replaced a variety of incompatible indexing schemes used in different client programs, each typically requiring custom modifications to each news server before they could be used. In modern NNTP implementations, NOV is exposed as the XOVER and related commands.

JSON Text-based open standard designed for human-readable data interchange

JavaScript Object Notation is an open-standard file format or data interchange format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types. It is a very common data format, with a diverse range of applications, such as serving as replacement for XML in AJAX systems.

A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.

Symbolic Link (SYLK) is a Microsoft file format typically used to exchange data between applications, specifically spreadsheets. SYLK files conventionally have a .slk suffix. Composed of only displayable ANSI characters, it can be easily created and processed by other applications, such as databases.

JSON streaming comprises communications protocols to delimit JSON objects built upon lower-level stream-oriented protocols, that ensures individual JSON objects are recognized, when the server and clients use the same one.

Fielded Text is a proposed standard which provides structure and schema definition to text files which contain tables of values. The standard allows the format and structure of the data within the text file to be specified by a Meta file. This Meta file can then be used to access the data in the file in manner similar to which data is accessed in a database.

References

  1. 1 2 3 4 Shafranovich, Y. (October 2005). Common Format and MIME Type for CSV Files. IETF. p. 1. doi: 10.17487/RFC4180 . RFC 4180.
  2. 1 2 Shafranovich (2005) states, "This RFC documents the format of comma separated values (CSV) files and formally registers the "text/csv" MIME type for CSV in accordance with RFC 2048".
  3. "CSV - Comma Separated Values" . Retrieved 2017-12-02.
  4. "CSV Files" . Retrieved June 4, 2014.
  5. "Comma Separated Values (CSV) Standard File Format". Edoceo, Inc. Retrieved June 4, 2014.
  6. IBM FORTRAN Program Products for OS and the CMS Component of VM/370 General Information (PDF) (first ed.), July 1972, p. 17, GC28-6884-0, retrieved February 5, 2016, For users familiar with the predecessor FORTRAN IV G and H processors, these are the major new language capabilities
  7. "List-Directed I/O", Fortran 77 Language Reference, Oracle
  8. "SuperCalc², spreadsheet package for IBM, CP/M" . Retrieved December 11, 2017.
  9. "Comma-Separated-Value Format File Structure" . Retrieved December 11, 2017.
  10. "CSV, Comma Separated Values (RFC 4180)" . Retrieved June 4, 2014.
  11. See sparql11-results-csv-tsv, the first W3C recommendation scoped in CSV and filling some of RFC4180's deficiencies.
  12. "Model for Tabular Data and Metadata on the Web - W3C Recommendation 17 December 2015" . Retrieved March 23, 2016.
  13. "Frictionless Data 1.0 released". Open Knowledge International. 2016. Retrieved 2017-09-04.
  14. "CSV on the Web Working Group". W3C CSV WG. 2013. Retrieved 2015-04-22.
  15. CSV on the Web Repository (on GitHub)
  16. Model for Tabular Data and Metadata on the Web (W3C Recommendation)
  17. Shafranovich (2005) states, "Within the header and each record, there may be one or more fields, separated by commas."
  18. "EmacsWiki: Csv Nav".

Further reading