Palindrome tree

Last updated
Palindrome tree
Type Tree
Invented2015
Invented byMikhail Rubinchik, Arseny M. Shur
Time complexity in big O notation
AlgorithmAverageWorst case
Space O(n) O(n)
Search O(n*log σ) O(n*log σ)
Insert O(log σ) O(n)

In computer science a palindrome tree, also called an EerTree, [1] is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem [2] (can a given string be divided into exactly k palindromes), palindromic length of a string [3] (what is the minimum number of palindromes needed to construct the string), and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

Contents

Description

Palindrome Tree example for TACOCAT, where solid lines are character edges and dashed lines are suffix edges Palindrome Tree TACOCAT Example.png
Palindrome Tree example for TACOCAT, where solid lines are character edges and dashed lines are suffix edges

Like most trees, a palindrome tree consists of vertices and directed edges. Each vertex in the tree represents a palindrome (e.g. 'tacocat') but only stores the length of the palindrome, and each edge represents either a character or a suffix. The character edges represent that when the character is appended to both ends of the palindrome represented by the source vertex, the palindrome in the destination vertex is created (e.g. an edge labeled 't' would connect the source vertex 'acoca' to the destination vertex 'tacocat'). The suffix edge connects each palindrome to the largest palindrome suffix it possesses (in the previous example 'tacocat' would have a suffix edge to 't', and 'atacocata' would have a suffix link to 'ata'). Where palindrome trees differ from regular trees, is that they have two roots (as they are in fact two separate trees). The two roots represent palindromes of length −1, and 0. That is, if the character 'a' is appended to both roots the tree will produce 'a' and 'aa' respectively. Since each edge adds (or removes) an even number of characters, the two trees are only ever connected by suffix edges.

Operations

Add

Since a palindrome tree follows an online construction, it maintains a pointer to the last palindrome added to the tree. To add the next character to the palindrome tree, add(x) first checks if the first character before the palindrome matches the character being added, if it does not, the suffix links are followed until a palindrome can be added to the tree. Once a palindrome has been found, if it already existed in the tree, there is no work to do. Otherwise, a new vertex is added with a link from the suffix to the new vertex, and a suffix link for the new vertex is added. If the length of the new palindrome is 1, the suffix link points to the root of the palindrome tree that represents a length of −1.

# S -> Input String# x -> position in the string of the character being addeddefadd(x:int)->bool:"""Add character to the palindrome tree."""whileTrue:ifx-1-current.length>=0andS[x-1-current.length]==S[x]:breakcurrent=current.suffixifcurrent.add[S[x]]isnotNone:returnFalsesuffix=currentcurrent=Palindrome_Vertex()current.length=suffix.length+2suffix.add[S[x]]=currentifcurrent.length==1:current.suffix=rootreturnTruewhileTrue:suffix=suffix.suffixifx-1-suffix.length>=0andS[x-1-suffix.length]==S[x]:current.suffix=suffix.add[S[x]]returnTrue

Joint trees

Finding palindromes that are common to multiple strings or unique to a single string can be done with additional space where is the number of strings being compared. This is accomplished by adding an array of length to each vertex, and setting the flag to 1 at index if that vertex was reached when adding string . The only other modification needed is to reset the current pointer to the root at the end of each string. By joining trees in such a manner the following problems can be solved:

Complexity

Time

Constructing a palindrome tree takes time, where is the length of the string and is the size of the alphabet. With calls to add(x), each call takes amortized time. This is a result of each call to add(x) increases the depth of the current vertex (the last palindrome in the tree) by at most one, and searching all possible character edges of a vertex takes time. By assigning the cost of moving up and down the tree to each call to add(x), the cost of moving up the tree more than once is 'paid for' by an equal number of calls to add(x) when moving up the tree did not occur.

Space

A palindrome tree takes space: At most vertices to store the sub-palindromes and two roots, edges, linking the vertices and suffix edges.

Space–time tradeoff

If instead of storing only the add edges that exist for each palindrome an array of length edges is stored, finding the correct edge can be done in constant time reducing construction time to while increasing space to , where is the number of palindromes.

Related Research Articles

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

<span class="mw-page-title-main">Simplex</span> Multi-dimensional generalization of triangle

In geometry, a simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. For example,

<span class="mw-page-title-main">Breadth-first search</span> Algorithm to search the nodes of a graph

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

In the mathematical disciplines of topology and geometry, an orbifold is a generalization of a manifold. Roughly speaking, an orbifold is a topological space which is locally a finite group quotient of a Euclidean space.

<span class="mw-page-title-main">Deterministic finite automaton</span> Finite-state machine

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

<span class="mw-page-title-main">Centrality</span> Degree of connectedness within a graph

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

In computer science, a kernelization is a technique for designing efficient algorithms that achieve their efficiency by a preprocessing stage in which inputs to the algorithm are replaced by a smaller input, called a "kernel". The result of solving the problem on the kernel should either be the same as on the original input, or it should be easy to transform the output on the kernel to the desired output for the original problem.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

<span class="mw-page-title-main">Top tree</span>

A top tree is a data structure based on a binary tree for unrooted dynamic trees that is used mainly for various path-related operations. It allows simple divide-and-conquer algorithms. It has since been augmented to maintain dynamically various properties of a tree such as diameter, center and median.

<span class="mw-page-title-main">Betweenness centrality</span> Measure of a graphs centrality, based on shortest paths

In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through or the sum of the weights of the edges is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

<span class="mw-page-title-main">Wavelet Tree</span>

The Wavelet Tree is a succinct data structure to store strings in compressed space. It generalizes the and operations defined on bitvectors to arbitrary alphabets.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

<span class="mw-page-title-main">Graham–Pollak theorem</span>

In graph theory, the Graham–Pollak theorem states that the edges of an -vertex complete graph cannot be partitioned into fewer than complete bipartite graphs. It was first published by Ronald Graham and Henry O. Pollak in two papers in 1971 and 1972, in connection with an application to telephone switching circuitry.

<span class="mw-page-title-main">Simplex tree</span> Topological data

In topological data analysis, a simplex tree is a type of trie used to represent efficiently any general simplicial complex. Through its nodes, this data structure notably explicitly represents all the simplices. Its flexible structure allows implementation of many basic operations useful to computing persistent homology. This data structure was invented by Jean-Daniel Boissonnat and Clément Maria in 2014, in the article The Simplex Tree: An Efficient Data Structure for General Simplicial Complexes. This data structure offers efficient operations on sparse simplicial complexes. For dense or maximal simplices, Skeleton-Blocker representations or Toplex Map representations are used.

References

  1. Rubinchik, Mikhail; Shur, Arseny M. (2015). "Eertree: An Efficient Data Structure for Processing Palindromes in Strings". European Journal of Combinatorics . arXiv: 1506.04862v1 .
  2. Galil, Zvi; Seiferas, Joel (1978). "A Linear-Time On-Line Recognition Algorithm for Palstar". Journal of the ACM. 25 (1): 102–111. doi: 10.1145/322047.322056 . S2CID   41095273.
  3. Fici, Gabriele; Gagie, Travis; Kärkkäinen, Juha; Kempa, Dominik (2014). "A subquadratic algorithm for minimum palindromic factorization". Journal of Discrete Algorithms. 28: 41–48. arXiv: 1403.2431 . doi:10.1016/j.jda.2014.08.001. S2CID   14871164.