String operations

Last updated

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

Contents

Strings and languages

A string is a finite sequence of characters. The empty string is denoted by . The concatenation of two string and is denoted by , or shorter by . Concatenating with the empty string makes no difference: . Concatenation of strings is associative: .

For example, .

A language is a finite or infinite set of strings. Besides the usual set operations like union, intersection etc., concatenation can be applied to languages: if both and are languages, their concatenation is defined as the set of concatenations of any string from and any string from , formally . Again, the concatenation dot is often omitted for brevity.

The language consisting of just the empty string is to be distinguished from the empty language . Concatenating any language with the former doesn't make any change: , while concatenating with the latter always yields the empty language: . Concatenation of languages is associative: .

For example, abbreviating , the set of all three-digit decimal numbers is obtained as . The set of all decimal numbers of arbitrary length is an example for an infinite language.

Alphabet of a string

The alphabet of a string is the set of all of the characters that occur in a particular string. If s is a string, its alphabet is denoted by

The alphabet of a language is the set of all characters that occur in any string of , formally: .

For example, the set is the alphabet of the string , and the above is the alphabet of the above language as well as of the language of all decimal numbers.

String substitution

Let L be a language, and let Σ be its alphabet. A string substitution or simply a substitution is a mapping f that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a character a ∈ Σ, one has f(a)=La where La ⊆ Δ* is some language whose alphabet is Δ. This mapping may be extended to strings as

f(ε)=ε

for the empty string ε, and

f(sa)=f(s)f(a)

for string sL and character a ∈ Σ. String substitutions may be extended to entire languages as [1]

Regular languages are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language. [2] Similarly, context-free languages are closed under string substitution. [3] [note 1]

A simple example is the conversion fuc(.) to uppercase, which may be defined e.g. as follows:

charactermapped to languageremark
xfuc(x)
a{ ‹A› }map lowercase char to corresponding uppercase char
A{ ‹A› }map uppercase char to itself
ß{ ‹SS› }no uppercase char available, map to two-char string
‹0›{ ε }map digit to empty string
‹!›{ }forbid punctuation, map to empty language
...similar for other chars

For the extension of fuc to strings, we have e.g.

For the extension of fuc to languages, we have e.g.

String homomorphism

A string homomorphism (often referred to simply as a homomorphism in formal language theory) is a string substitution such that each character is replaced by a single string. That is, , where is a string, for each character . [note 2] [4]

String homomorphisms are monoid morphisms on the free monoid, preserving the empty string and the binary operation of string concatenation. Given a language , the set is called the homomorphic image of . The inverse homomorphic image of a string is defined as

while the inverse homomorphic image of a language is defined as

In general, , while one does have

and

for any language .

The class of regular languages is closed under homomorphisms and inverse homomorphisms. [5] Similarly, the context-free languages are closed under homomorphisms [note 3] and inverse homomorphisms. [6]

A string homomorphism is said to be ε-free (or e-free) if for all a in the alphabet . Simple single-letter substitution ciphers are examples of (ε-free) string homomorphisms.

An example string homomorphism guc can also be obtained by defining similar to the above substitution: guc(‹a›) = ‹A›, ..., guc(‹0›) = ε, but letting guc be undefined on punctuation chars. Examples for inverse homomorphic images are

For the latter language, guc(guc−1({ ‹A›, ‹bb› })) = guc({ ‹a› }) = { ‹A› } ≠ { ‹A›, ‹bb› }. The homomorphism guc is not ε-free, since it maps e.g. ‹0› to ε.

A very simple string homomorphism example that maps each character to just a character is the conversion of an EBCDIC-encoded string to ASCII.

String projection

If s is a string, and is an alphabet, the string projection of s is the string that results by removing all characters that are not in . It is written as . It is formally defined by removal of characters from the right hand side:

Here denotes the empty string. The projection of a string is essentially the same as a projection in relational algebra.

String projection may be promoted to the projection of a language. Given a formal language L, its projection is given by

[ citation needed ]

Right and left quotient

The right quotient of a character a from a string s is the truncation of the character a in the string s, from the right hand side. It is denoted as . If the string does not have a on the right hand side, the result is the empty string. Thus:

The quotient of the empty string may be taken:

Similarly, given a subset of a monoid , one may define the quotient subset as

Left quotients may be defined similarly, with operations taking place on the left of a string.[ citation needed ]

Hopcroft and Ullman (1979) define the quotient L1/L2 of the languages L1 and L2 over the same alphabet as L1/L2 = { s | ∃tL2. stL1 }. [7] This is not a generalization of the above definition, since, for a string s and distinct characters a, b, Hopcroft's and Ullman's definition implies yielding {}, rather than { ε }.

The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language L1 and an arbitrary language L2 is known as Brzozowski derivative; if L2 is represented by a regular expression, so can be the left quotient. [8]

Syntactic relation

The right quotient of a subset of a monoid defines an equivalence relation, called the right syntactic relation of S. It is given by

The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if

is finite. In the case that M is the monoid of words over some alphabet, S is then a regular language, that is, a language that can be recognized by a finite state automaton. This is discussed in greater detail in the article on syntactic monoids.[ citation needed ]

Right cancellation

The right cancellation of a character a from a string s is the removal of the first occurrence of the character a in the string s, starting from the right hand side. It is denoted as and is recursively defined as

The empty string is always cancellable:

Clearly, right cancellation and projection commute:

[ citation needed ]

Prefixes

The prefixes of a string is the set of all prefixes to a string, with respect to a given language:

where .

The prefix closure of a language is

Example:

A language is called prefix closed if .

The prefix closure operator is idempotent:

The prefix relation is a binary relation such that if and only if . This relation is a particular example of a prefix order.[ citation needed ]

See also

Notes

  1. Although every regular language is also context-free, the previous theorem is not implied by the current one, since the former yields a shaper result for regular languages.
  2. Strictly formally, a homomorphism yields a language consisting of just one string, i.e. .
  3. This follows from the above-mentioned closure under arbitrary substitutions.

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.

In algebra, a homomorphism is a structure-preserving map between two algebraic structures of the same type. The word homomorphism comes from the Ancient Greek language: ὁμός meaning "same" and μορφή meaning "form" or "shape". However, the word was apparently introduced to mathematics due to a (mis)translation of German ähnlich meaning "similar" to ὁμός meaning "same". The term "homomorphism" appeared as early as 1892, when it was attributed to the German mathematician Felix Klein (1849–1925).

In mathematical logic and computer science, the Kleene star is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid construction. The application of the Kleene star to a set is written as . It is widely used for regular expressions, which is the context in which it was introduced by Stephen Kleene to characterize certain automata, where it means "zero or more repetitions".

  1. If is a set of strings, then is defined as the smallest superset of that contains the empty string and is closed under the string concatenation operation.
  2. If is a set of symbols or characters, then is the set of all strings over symbols in , including the empty string .
<span class="mw-page-title-main">Pushdown automaton</span> Type of automaton

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

<span class="mw-page-title-main">Pauli matrices</span> Matrices important in quantum mechanics and the study of spin

In mathematical physics and mathematics, the Pauli matrices are a set of three 2 × 2 complex matrices which are Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

<span class="mw-page-title-main">Uncertainty principle</span> Foundational principle in quantum physics

The uncertainty principle, also known as Heisenberg's indeterminacy principle, is a fundamental concept in quantum mechanics. It states that there is a limit to the precision with which certain pairs of physical properties, such as position and momentum, can be simultaneously known. In other words, the more accurately one property is measured, the less accurately the other property can be known.

<span class="mw-page-title-main">Clifford algebra</span> Algebra based on a vector space with a quadratic form

In mathematics, a Clifford algebra is an algebra generated by a vector space with a quadratic form, and is a unital associative algebra. As K-algebras, they generalize the real numbers, complex numbers, quaternions and several other hypercomplex number systems. The theory of Clifford algebras is intimately connected with the theory of quadratic forms and orthogonal transformations. Clifford algebras have important applications in a variety of fields including geometry, theoretical physics and digital image processing. They are named after the English mathematician William Kingdon Clifford (1845–1879).

<span class="mw-page-title-main">Hooke's law</span> Physical law: force needed to deform a spring scales linearly with distance

In physics, Hooke's law is an empirical law which states that the force needed to extend or compress a spring by some distance scales linearly with respect to that distance—that is, Fs = kx, where k is a constant factor characteristic of the spring, and x is small compared to the total possible deformation of the spring. The law is named after 17th-century British physicist Robert Hooke. He first stated the law in 1676 as a Latin anagram. He published the solution of his anagram in 1678 as: ut tensio, sic vis. Hooke states in the 1678 work that he was aware of the law since 1660.

In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if

<span class="mw-page-title-main">Granular material</span> Conglomeration of discrete solid, macroscopic particles

A granular material is a conglomeration of discrete solid, macroscopic particles characterized by a loss of energy whenever the particles interact. The constituents that compose granular material are large enough such that they are not subject to thermal motion fluctuations. Thus, the lower size limit for grains in granular material is about 1 μm. On the upper size limit, the physics of granular materials may be applied to ice floes where the individual grains are icebergs and to asteroid belts of the Solar System with individual grains being asteroids.

<span class="mw-page-title-main">Two-state quantum system</span> Simple quantum mechanical system

In quantum mechanics, a two-state system is a quantum system that can exist in any quantum superposition of two independent quantum states. The Hilbert space describing such a system is two-dimensional. Therefore, a complete basis spanning the space will consist of two independent states. Any two-state system can also be seen as a qubit.

<span class="mw-page-title-main">Beam emittance</span> Property of a charged particle beam

In accelerator physics, emittance is a property of a charged particle beam. It refers to the area occupied by the beam in a position-and-momentum phase space.

In mathematics and theoretical computer science, a semiautomaton is a deterministic finite automaton having inputs but no output. It consists of a set Q of states, a set Σ called the input alphabet, and a function T: Q × Σ → Q called the transition function.

In mathematics and computer science, a history monoid is a way of representing the histories of concurrently running computer processes as a collection of strings, each string representing the individual history of a process. The history monoid provides a set of synchronization primitives for providing rendezvous points between a set of independently executing processes or threads.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In mathematics, the injective tensor product of two topological vector spaces (TVSs) was introduced by Alexander Grothendieck and was used by him to define nuclear spaces. An injective tensor product is in general not necessarily complete, so its completion is called the completed injective tensor products. Injective tensor products have applications outside of nuclear spaces. In particular, as described below, up to TVS-isomorphism, many TVSs that are defined for real or complex valued functions, for instance, the Schwartz space or the space of continuously differentiable functions, can be immediately extended to functions valued in a Hausdorff locally convex TVS without any need to extend definitions from real/complex-valued functions to -valued functions.

References

  1. Hopcroft, Ullman (1979), Sect.3.2, p.60
  2. Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.4, p.60
  3. Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.2, p.131
  4. Hopcroft, Ullman (1979), Sect.3.2, p.60-61
  5. Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.5, p.61
  6. Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.3, p.132
  7. Hopcroft, Ullman (1979), Sect.3.2, p.62
  8. Janusz A. Brzozowski (1964). "Derivatives of Regular Expressions". J ACM. 11 (4): 481–494. doi: 10.1145/321239.321249 . S2CID   14126942.