Delimiter

Last updated
A stylistic depiction of values inside of a so-named comma-separated values (CSV) text file. The commas (shown in red) are used as field delimiters. Csv delimited000.svg
A stylistic depiction of values inside of a so-named comma-separated values (CSV) text file. The commas (shown in red) are used as field delimiters.

A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. [1] [2] An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.[ citation needed ]

Contents

In mathematics, delimiters are often used to specify the scope of an operation, and can occur both as isolated symbols (e.g., colon in "") and as a pair of opposing-looking symbols (e.g., angled brackets in ).

Delimiters represent one of various means of specifying boundaries in a data stream. Declarative notation, for example, is an alternate method (without the use of delimiters) that uses a length field at the start of a data stream to specify the number of characters that the data stream contains. [3]

Overview

Delimiters may be characterized as field and record delimiters, or as bracket delimiters.

Field and record delimiters

Field delimiters separate data fields. Record delimiters separate groups of fields. [4]

For example, the CSV format uses a comma as the delimiter between fields, and an end-of-line indicator as the delimiter between records:

fname,lname,age,salary nancy,davolio,33,$30000 erin,borakova,28,$25250 tony,raphael,35,$28700 

This specifies a simple flat-file database table using the CSV file format.

Bracket delimiters

Bracket delimiters, also called block delimiters, region delimiters, or balanced delimiters, mark both the start and end of a region of text. [5] [6]

Common examples of bracket delimiters include: [7]

DelimitersDescription
() Parentheses. The Lisp programming language syntax is cited as recognizable primarily by its use of parentheses. [8]
{}Braces (also called curly brackets [9] ).
[]Brackets (commonly used to denote a subscript).
<> Angle brackets. [10]
""commonly used to denote string literals. [11]
''commonly used to denote character literals. [11]
<??>used to indicate XML processing instructions. [12]
/**/used to denote comments in some programming languages. [13]
<%%>used in some web templates to specify language boundaries. [14]

Conventions

Historically, computing platforms have used certain delimiters by convention. [15] [16] The following tables depict a few examples for comparison.

Programming languages (See also, Comparison of programming languages (syntax)).

String LiteralEnd of Statement
Pascalsinglequotesemicolon
Pythondoublequote, singlequote end of line (EOL)

Field and Record delimiters (See also, ASCII, Control character).

End of FieldEnd of RecordEnd of File
Unix-like systems including macOS, AmigaOS Tab LF none
Windows, MS-DOS, OS/2, CP/M Tab CRLF none (except in CP/M), Control-Z [17]
Classic Mac OS, Apple DOS, ProDOS, GS/OS Tab CR none
ASCII/Unicode UNIT SEPARATOR
Position 31 (U+001F)
RECORD SEPARATOR
Position 30 (U+001E)
FILE SEPARATOR
Position 28 (U+001C)

Delimiter collision

Delimiter collision is a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions. [4] [18] In the case of XML, for example, this can occur whenever an author attempts to specify an angle bracket character.

In most file types there is both a field delimiter and a record delimiter, both of which are subject to collision. In the case of comma-separated values files, for example, field collision can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000"), and record delimiter collision would occur whenever a field contained multiple lines. Both record and field delimiter collision occur frequently in text files.

In some contexts, a malicious user or attacker may seek to exploit this problem intentionally. Consequently, delimiter collision can be the source of security vulnerabilities and exploits. Malicious users can take advantage of delimiter collision in languages such as SQL and HTML to deploy such well-known attacks as SQL injection and cross-site scripting, respectively.

Solutions

Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. This ad hoc approach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream, and offers no security against malicious collisions. Other, more formal conventions are therefore applied as well.

ASCII delimited text

The ASCII and Unicode character sets were designed to solve this problem by the provision of non-printing characters that can be used as delimiters. These are the range from ASCII 28 to 31.

ASCII Dec SymbolUnicode NameCommon NameUsage
28INFORMATION SEPARATOR FOUR file separator End of file. Or between a concatenation of what might otherwise be separate files.
29INFORMATION SEPARATOR THREE group separator Between sections of data. Not needed in simple data files.
30INFORMATION SEPARATOR TWO record separator End of a record or row.
31INFORMATION SEPARATOR ONE unit separator Between fields of a record, or members of a row.

The use of ASCII 31 Unit separator as a field separator and ASCII 30 Record separator solves the problem of both field and record delimiters that appear in a text data stream. [19]

Escape character

One method for avoiding delimiter collision is to use escape characters. From a language design standpoint, these are adequate, but they have drawbacks:

  • text can be rendered unreadable when littered with numerous escape characters, a problem referred to as leaning toothpick syndrome (due to use of \ to escape / in Perl regular expressions, leading to sequences such as "\/\/");
  • text becomes difficult to parse through regular expression
  • they require a mechanism to "escape the escapes" when not intended as escape characters; and
  • although easy to type, they can be cryptic to someone unfamiliar with the language. [20]
  • they do not protect against injection attacks [ citation needed ]

Escape sequence

Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is in string literals that include a doublequote (") character. For example in Perl, the code:

print"Nancy said \x22Hello World!\x22 to the crowd.";### use \x22

produces the same output as:

print"Nancy said \"Hello World!\" to the crowd.";### use escape char

One drawback of escape sequences, when used by people, is the need to memorize the codes that represent individual characters (see also: character entity reference, numeric character reference).

Dual quoting delimiters

In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a single quote (') or a double quote (") to specify a string literal. For example, in Perl:

print'Nancy said "Hello World!" to the crowd.';

produces the desired output without requiring escapes. This approach, however, only works when the string does not contain both types of quotation marks.

Padding quoting delimiters

In contrast to escape sequences and escape characters, padding delimiters provide yet another way to avoid delimiter collision. Visual Basic, for example, uses double quotes as delimiters. This is similar to escaping the delimiter.

print"Nancy said ""Hello World!"" to the crowd."

produces the desired output without requiring escapes. Like regular escaping it can, however, become confusing when many quotes are used. The code to print the above source code would look more confusing:

print"print ""Nancy said """"Hello World!"""" to the crowd."""

Configurable alternative quoting delimiters

In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision. [7] :63

For example, in Perl:

printqq^Nancy doesn't want to say "Hello World!" anymore.^;printqq@Nancy doesn't want to say "Hello World!" anymore.@;printqq(Nancy doesn't want to say "Hello World!" anymore.);

all produce the desired output through use of quote operators, which allow any convenient character to act as a delimiter. Although this method is more flexible, few languages support it. Perl and Ruby are two that do. [7] :62 [21]

Content boundary

A content boundary is a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a sequence of characters that is guaranteed to always indicate a boundary between parts in a multi-part message, with no other possible interpretation. [22]

The delimiter is frequently generated from a random sequence of characters that is statistically improbable to occur in the content. This may be followed by an identifying mark such as a UUID, a timestamp, or some other distinguishing mark. Alternatively, the content may be scanned to guarantee that a delimiter does not appear in the text. This may allow the delimiter to be shorter or simpler, and increase the human readability of the document. (See e.g., MIME, Here documents).

Whitespace or indentation

Some programming and computer languages allow the use of whitespace delimiters or indentation as a means of specifying boundaries between independent regions in text. [23]

Regular expression syntax

In specifying a regular expression, alternate delimiters may also be used to simplify the syntax for match and substitution operations in Perl. [24]

For example, a simple match operation may be specified in Perl with the following syntax:

$string1='Nancy said "Hello World!" to the crowd.';# specify a target stringprint$string1=~m/[aeiou]+/;# match one or more vowels

The syntax is flexible enough to specify match operations with alternate delimiters, making it easy to avoid delimiter collision:

$string1='Nancy said "http://Hello/World.htm" is not a valid address.';# target stringprint$string1=~m@http://@;# match using alternate regular expression delimiterprint$string1=~m{http://};# same as previous, but different delimiterprint$string1=~m!http://!;# same as previous, but different delimiter.

Here document

A Here document allows the inclusion of arbitrary content by describing a special end sequence. Many languages support this including PHP, bash scripts, ruby and perl. A here document starts by describing what the end sequence will be and continues until that sequence is seen at the start of a new line. [25]

Here is an example in perl:

print<<ENDOFHEREDOC;It's very hard to encode a string with "certain characters".Newlines, commas, and other characters can cause delimiter collisions.ENDOFHEREDOC

This code would print:

It's very hard to encode a string with "certain characters".  Newlines, commas, and other characters can cause delimiter collisions. 

By using a special end sequence all manner of characters are allowed in the string.

ASCII armor

Although principally used as a mechanism for text encoding of binary data, ASCII armoring is a programming and systems administration technique that also helps to avoid delimiter collision in some circumstances. [26] [27] This technique is contrasted from the other approaches described above because it is more complicated, and therefore not suitable for small applications and simple data storage formats. The technique employs a special encoding scheme, such as base64, to ensure that delimiter or other significant characters do not appear in transmitted data. The purpose is to prevent multilayered escaping, i.e. for doublequotes.

This technique is used, for example, in Microsoft's ASP.NET web development technology, and is closely associated with the "VIEWSTATE" component of that system. [28]

Example

The following simplified example demonstrates how this technique works in practice.

The first code fragment shows a simple HTML tag in which the VIEWSTATE value contains characters that are incompatible with the delimiters of the HTML tag itself:

<inputtype="hidden"name="__VIEWSTATE"value="BookTitle:Nancy doesn't say "HelloWorld!"anymore."/>

This first code fragment is not well-formed, and would therefore not work properly in a "real world" deployed system.

To store arbitrary text in an HTML attribute, HTML entities can be used. In this case "&quot;" stands in for the double-quote:

<inputtype="hidden"name="__VIEWSTATE"value="BookTitle:Nancy doesn't say &quot;Hello World!&quot; anymore."/>

Alternatively, any encoding could be used that doesn't include characters that have special meaning in the context, such as base64:

<inputtype="hidden"name="__VIEWSTATE"value="Qm9va1RpdGxlOk5hbmN5IGRvZXNuJ3Qgc2F5ICJIZWxsbyBXb3JsZCEiIGFueW1vcmUu"/>

Or percent-encoding:

<inputtype="hidden"name="__VIEWSTATE"value="BookTitle:Nancy%20doesn%27t%20say%20%22Hello%20World!%22%20anymore."/>

This prevents delimiter collision and ensures that incompatible characters will not appear inside the HTML code, regardless of what characters appear in the original (decoded) text. [28]

See also

Related Research Articles

A "Hello, World!" program is generally a simple computer program which outputs to the screen a message similar to "Hello, World!" while ignoring any user input. A small piece of code in most general-purpose programming languages, this program is used to illustrate a language's basic syntax. A "Hello, World!" program is often the first written by a student of a new programming language, but such a program can also be used as a sanity check to ensure that the computer software intended to compile or run source code is correctly installed, and that its operator understands how to use it.

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.

In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding characters.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

YAML(see § History and name) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Markup Language (XML) but has a minimal syntax that intentionally differs from Standard Generalized Markup Language (SGML). It uses Python-style indentation to indicate nesting and does not require quotes around most string values.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

The backtick` is a typographical mark used mainly in computing. It is also known as backquote, grave, or grave accent.

Formats that use delimiter-separated values store two-dimensional arrays of data by separating the values in each row with specific delimiter characters. Most database and spreadsheet programs are able to read or save data in a delimited format. Due to their wide support, DSV files can be used in data exchange among many applications.

In computing, a here document is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace in the text.

In computer programming, leaning toothpick syndrome (LTS) is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes ("\"), to avoid delimiter collision.

<span class="mw-page-title-main">Comment (computer programming)</span> Explanatory note in the source code of a computer program

In computer programming, a comment is a programmer-readable explanation or annotation in the source code of a computer program. They are added with the purpose of making the source code easier for humans to understand, and are generally ignored by compilers and interpreters. The syntax of comments in various programming languages varies considerably.

In computer programming, a netstring is a formatting method for byte strings that uses a declarative notation to indicate the size of the string.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

Data Format Description Language, published as an Open Grid Forum Recommendation in February 2021, is a modeling language for describing general text and binary data in a standard way. A DFDL model or schema allows any text or binary data to be read from its native format and to be presented as an instance of an information set.. The same DFDL schema also allows data to be taken from an instance of an information set and written out to its native format.

References

  1. "Definition: delimiter". Federal Standard 1037C - Telecommunications: Glossary of Telecommunication Terms. Archived from the original on 2013-03-05. Retrieved 2019-11-25.
  2. "What is a Delimiter?". www.computerhope.com. Retrieved 2020-08-09.
  3. Rohl, Jeffrey S. (1973). Programming in Fortran. Oxford Oxfordshire: Oxford University Press. ISBN   978-0-7190-0555-8. describing the method in Hollerith notation under the Fortran programming language.
  4. 1 2 de Moor, Georges J. (1993). Progress in Standardization in Health Care Informatics. IOS Press. ISBN   90-5199-114-2. p. 141
  5. Friedl, Jeffrey E. F. (2002). Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools. O'Reilly. ISBN   0-596-00289-0. p. 319
  6. Scott, Michael Lee (1999). Programming Language Pragmatics. Morgan Kaufmann. ISBN   1-55860-442-1.
  7. 1 2 3 Wall, Larry; Orwant, Jon (July 2000). Programming Perl (Third ed.). O'Reilly. ISBN   0-596-00027-8.
  8. Kaufmann, Matt (2000). Computer-Aided Reasoning: An Approach. Springer. ISBN   0-7923-7744-3.p. 3
  9. Meyer, Mark (2005). Explorations in Computer Science. Oxford Oxfordshire: Oxford University Press. ISBN   978-0-7637-3832-7. references C-style programming languages prominently featuring curly brackets and semicolons.
  10. Dilligan, Robert (1998). Computing in the Web Age. Oxford Oxfordshire: Oxford University Press. ISBN   978-0-306-45972-6.Describes syntax and delimiters used in HTML.
  11. 1 2 Schwartz, Randal (2005). Learning Perl. Oxford Oxfordshire: Oxford University Press. ISBN   978-0-596-10105-3.Describes string literals.
  12. Watt, Andrew (2003). Sams Teach Yourself Xml in 10 Minutes . Oxford Oxfordshire: Oxford University Press. ISBN   978-0-672-32471-0. Describes XML processing instruction. p. 21.
  13. Cabrera, Harold (2002). C# for Java Programmers. Oxford Oxfordshire: Oxford University Press. ISBN   978-1-931836-54-8. Describes single-line and multi-line comments. p. 72.
  14. "Jakarta Server Pages Specification, Version 4.0akarta Server Pages Specification, Version 4.0". GitHub . Retrieved 2023-02-10.
  15. ISO/TC 97/SC 2 (December 1, 1975). The set of control characters for ISO 646 (PDF). ITSCJ/IPSJ. ISO-IR-1.{{citation}}: CS1 maint: numeric names: authors list (link)
  16. American National Standards Institute (December 1, 1975). ASCII graphic character set (PDF). ITSCJ/IPSJ. ISO-IR-6.
  17. Lewine, Donald (1991). Posix Programmer's Guide. Oxford Oxfordshire: Oxford University Press. ISBN   978-0-937175-73-6. Describes use of control-z. p. 156,
  18. Friedl, Jeffrey (2006). Mastering Regular Expressions. Oxford Oxfordshire: Oxford University Press. ISBN   978-0-596-52812-6. describing solutions for embedded-delimiter problems p. 472.
  19. Discussion on ASCII Delimited Text vs CSV and Tab Delimited
  20. Kahrel, Peter (2006). Automating InDesign with Regular Expressions. O'Reilly. p. 11. ISBN   0-596-52937-6.
  21. Yukihiro, Matsumoto (2001). Ruby in a Nutshell. O'Reilly. ISBN   0-596-00214-9. In Ruby, these are indicated as general delimited strings. p. 11
  22. Network Protocols Handbook. Javvin Technologies Inc. 2005. ISBN   0-9740945-2-8. p. 26
  23. Computational Linguistics and Intelligent Text Processing. Oxford Oxfordshire: Oxford University Press. 2001. ISBN   978-3-540-41687-6. Describes whitespace delimiters. p. 258.
  24. Friedl, Jeffrey (2006). Mastering Regular Expressions. Oxford Oxfordshire: Oxford University Press. ISBN   978-0-596-52812-6. page 472.
  25. Perl operators and precedence
  26. Rhee, Man (2003). Internet Security: Cryptographic Principles, Algorithms and Protocols. John Wiley and Sons. ISBN   0-470-85285-2.(an example usage of ASCII armoring in encryption applications)
  27. Gross, Christian (2005). Open Source for Windows Administrators . Charles River Media. ISBN   1-58450-347-5.(an example usage of ASCII armoring in encryption applications)
  28. 1 2 Kalani, Amit (2004). Developing and Implementing Web Applications with Visual C# . NET and Visual Studio . NET. Que. ISBN   0-7897-2901-6.(describes the use of Base64 encoding and VIEWSTATE inside HTML source code)