Empty string

Last updated

In formal language theory, the empty string, or empty word, is the unique string of length zero.

Contents

Formal theory

Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special case where the sequence has length zero, so there are no symbols in the string. There is only one empty string, because two strings are only different if they have different lengths or a different sequence of symbols. In formal treatments, [1] the empty string is denoted with ε or sometimes Λ or λ .

The empty string should not be confused with the empty language , which is a formal language (i.e. a set of strings) that contains no strings, not even the empty string.

The empty string has several properties:

In context-free grammars, a production rule that allows a symbol to produce the empty string is known as an ε-production, and the symbol is said to be "nullable".

Use in programming languages

In most programming languages, strings are a data type. Strings are typically stored at distinct memory addresses (locations). Thus, the same string (for example, the empty string) may be stored in two or more places in memory.

In this way, there could be multiple empty strings in memory, in contrast with the formal theory definition, for which there is only one possible empty string. However, a string comparison function would indicate that all of these empty strings are equal to each other.

Even a string of length zero can require memory to store it, depending on the format being used. In most programming languages, the empty string is distinct from a null reference (or null pointer) because a null reference points to no string at all, not even the empty string. The empty string is a legitimate string, upon which most string operations should work. Some languages treat some or all of the following in similar ways: empty strings, null references, the integer 0, the floating point number 0, the Boolean value false, the ASCII character NUL, or other such values.

The empty string is usually represented similarly to other strings. In implementations with string terminating character (null-terminated strings or plain text lines), the empty string is indicated by the immediate use of this terminating character.

Different functions, methods, macros, or idioms exist for checking if a string is empty in different languages.[ example needed ]

λ representationProgramming languages
"" C, C#, C++, Go, Haskell, Java, JavaScript, Julia, Lua, M, Objective-C (as a C string), OCaml, Perl, PHP, Python, Ruby, Scala, Standard ML, Swift, Tcl, Visual Basic .NET
'' APL, Delphi, JavaScript, Lua, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Smalltalk, SQL
character(0) R [3]
{'\0'} C, C++, Objective-C (as a C string)
std::string() C++
""s C++ (since the 2014 standard)
@"" Objective-C (as a constant NSString object)
[NSString string] Objective-C (as a new NSString object)
q(), qq() Perl
str() Python
%{}
%()
Ruby
String::new() [4] Rust
string.Empty C#, Visual Basic .NET
String.make 0 '-' OCaml
{} Tcl
[[]] Lua

Examples of empty strings

The empty string is a syntactically valid representation of zero in positional notation (in any base), which does not contain leading zeros. Since the empty string does not have a standard visual representation outside of formal language theory, the number zero is traditionally represented by a single decimal digit 0 instead.

Zero-filled memory area, interpreted as a null-terminated string, is an empty string.

Empty lines of text show the empty string. This can occur from two consecutive EOLs, as often occur in text files, and this is sometimes used in text processing to separate paragraphs, e.g. in MediaWiki.

See also

Related Research Articles

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.

In mathematical logic and computer science, the Kleene star is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid construction. The application of the Kleene star to a set is written as . It is widely used for regular expressions, which is the context in which it was introduced by Stephen Kleene to characterize certain automata, where it means "zero or more repetitions".

  1. If is a set of strings, then is defined as the smallest superset of that contains the empty string and is closed under the string concatenation operation.
  2. If is a set of symbols or characters, then is the set of all strings over symbols in , including the empty string .
<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenation theory, also called string theory, string concatenation is a primitive notion.

In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programming language. They are extensions of the basic Backus–Naur form (BNF) metasyntax notation.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character. Alternative names are C string, which refers to the C programming language and ASCIIZ.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A. The free semigroup on A is the subsemigroup of A containing all elements except the empty string. It is usually denoted A+.

In mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

In mathematics, and particularly in the theory of formal languages, shortlex is a total ordering for finite sequences of objects that can themselves be totally ordered. In the shortlex ordering, sequences are primarily sorted by cardinality (length) with the shortest sequences first, and sequences of the same length are sorted into lexicographical order. Shortlex ordering is also called radix, length-lexicographic, military, or genealogical ordering.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite, countable, or even uncountable.

<span class="mw-page-title-main">Terminal and nonterminal symbols</span> Categories of symbols in formal grammars

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.

In theoretical computer science, in particular in formal language theory, the Brzozowski derivative of a set of strings and a string is the set of all strings obtainable from a string in by cutting off the prefix . Formally:

References

  1. Corcoran, John; Frank, William; Maloney, Michael (1974). "String theory". Journal of Symbolic Logic. 39 (4): 625–637. doi:10.2307/2272846. JSTOR   2272846. S2CID   2168826.
  2. CSE1002 Lecture Notes – Lexicographic
  3. There are two ways to create "empty strings" in R; the other is listed here as "". character(0) creates empty character vectors, which will output 0 when counted.
  4. "String in std::string - Rust". doc.rust-lang.org. Retrieved 2022-11-30.