Palindrome tree

Palindrome tree
Palindrome tree
Type	Tree
Invented	2015
Invented by	Mikhail Rubinchik, Arseny M. Shur
Operation
Time complexity in big O notation
Operation	Average
Search	O(n*log σ)
Insert	O(log σ)
Space complexity
Space	O(n)

Last updated August 09, 2024

In computer science a palindrome tree, also called an EerTree,^[1] is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem^[2] (can a given string be divided into exactly k palindromes), palindromic length of a string^[3] (what is the minimum number of palindromes needed to construct the string), and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

Description

Palindrome Tree example for TACOCAT, where solid lines are character edges and dashed lines are suffix edges Palindrome Tree TACOCAT Example.png — Palindrome Tree example for TACOCAT, where solid lines are character edges and dashed lines are suffix edges

Like most trees, a palindrome tree consists of vertices and directed edges. Each vertex in the tree represents a palindrome (e.g. 'tacocat') but only stores the length of the palindrome, and each edge represents either a character or a suffix. The character edges represent that when the character is appended to both ends of the palindrome represented by the source vertex, the palindrome in the destination vertex is created (e.g. an edge labeled 't' would connect the source vertex 'acoca' to the destination vertex 'tacocat'). The suffix edge connects each palindrome to the largest palindrome suffix it possesses (in the previous example 'tacocat' would have a suffix edge to 't', and 'atacocata' would have a suffix link to 'ata'). Where palindrome trees differ from regular trees, is that they have two roots (as they are in fact two separate trees). The two roots represent palindromes of length −1, and 0. That is, if the character 'a' is appended to both roots the tree will produce 'a' and 'aa' respectively. Since each edge adds (or removes) an even number of characters, the two trees are only ever connected by suffix edges.

Operations

Add

Since a palindrome tree follows an online construction, it maintains a pointer to the last palindrome added to the tree. To add the next character to the palindrome tree, add(x) first checks if the first character before the palindrome matches the character being added, if it does not, the suffix links are followed until a palindrome can be added to the tree. Once a palindrome has been found, if it already existed in the tree, there is no work to do. Otherwise, a new vertex is added with a link from the suffix to the new vertex, and a suffix link for the new vertex is added. If the length of the new palindrome is 1, the suffix link points to the root of the palindrome tree that represents a length of −1.

# S -> Input String# x -> position in the string of the character being addeddefadd(x:int)->bool:"""Add character to the palindrome tree."""whileTrue:ifx-1-current.length>=0andS[x-1-current.length]==S[x]:breakcurrent=current.suffixifcurrent.add[S[x]]isnotNone:returnFalsesuffix=currentcurrent=Palindrome_Vertex()current.length=suffix.length+2suffix.add[S[x]]=currentifcurrent.length==1:current.suffix=rootreturnTruewhileTrue:suffix=suffix.suffixifx-1-suffix.length>=0andS[x-1-suffix.length]==S[x]:current.suffix=suffix.add[S[x]]returnTrue

Joint trees

Finding palindromes that are common to multiple strings or unique to a single string can be done with $O(n*i)$ additional space where $i$ is the number of strings being compared. This is accomplished by adding an array of length $i$ to each vertex, and setting the flag to 1 at index $i$ if that vertex was reached when adding string $i$ . The only other modification needed is to reset the current pointer to the root at the end of each string. By joining trees in such a manner the following problems can be solved:

Number of palindromes common to all strings
Number of unique palindromes in a string
Longest palindrome common to all strings
The number of palindromes that occur more often in one string than others

Complexity

Time

Constructing a palindrome tree takes $O(n\log {\sigma })$ time, where $n$ is the length of the string and $\sigma$ is the size of the alphabet. With $n$ calls to add(x), each call takes $O(\log {\sigma })$ amortized time. This is a result of each call to add(x) increases the depth of the current vertex (the last palindrome in the tree) by at most one, and searching all possible character edges of a vertex takes $O(\log {\sigma })$ time. By assigning the cost of moving up and down the tree to each call to add(x), the cost of moving up the tree more than once is 'paid for' by an equal number of calls to add(x) when moving up the tree did not occur.

Space

A palindrome tree takes $O(n)$ space: At most $n+2$ vertices to store the sub-palindromes and two roots, $n$ edges, linking the vertices and $n+2$ suffix edges.

Space–time tradeoff

If instead of storing only the add edges that exist for each palindrome an array of length $\sigma$ edges is stored, finding the correct edge can be done in constant time reducing construction time to $O(n+p*\sigma )$ while increasing space to $O(p*\sigma )$ , where $p$ is the number of palindromes.

Related Research Articles

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

<span class="mw-page-title-main">Simplex</span> Multi-dimensional generalization of triangle

In geometry, a simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. For example,

<span class="mw-page-title-main">Prim's algorithm</span> Method for finding minimum spanning trees

In computer science, Prim's algorithm is a greedy algorithm that finds a minimum spanning tree for a weighted undirected graph. This means it finds a subset of the edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. The algorithm operates by building this tree one vertex at a time, from an arbitrary starting vertex, at each step adding the cheapest possible connection from the tree to another vertex.

<span class="mw-page-title-main">Breadth-first search</span> Algorithm to search the nodes of a graph

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

In the mathematical disciplines of topology and geometry, an orbifold is a generalization of a manifold. Roughly speaking, an orbifold is a topological space which is locally a finite group quotient of a Euclidean space.

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

In computer science, a kernelization is a technique for designing efficient algorithms that achieve their efficiency by a preprocessing stage in which inputs to the algorithm are replaced by a smaller input, called a "kernel". The result of solving the problem on the kernel should either be the same as on the original input, or it should be easy to transform the output on the kernel to the desired output for the original problem.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

A top tree is a data structure based on a binary tree for unrooted dynamic trees that is used mainly for various path-related operations. It allows simple divide-and-conquer algorithms. It has since been augmented to maintain dynamically various properties of a tree such as diameter, center and median.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

The Wavelet Tree is a succinct data structure to store strings in compressed space. It generalizes the $and operations defined on bitvectors to arbitrary alphabets.$

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string $is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.$

In graph theory, the Graham–Pollak theorem states that the edges of an $-vertex complete graph cannot be partitioned into fewer than complete bipartite graphs. It was first published by Ronald Graham and Henry O. Pollak in two papers in 1971 and 1972, in connection with an application to telephone switching circuitry.$

<span class="mw-page-title-main">Simplex tree</span> Topological data

In topological data analysis, a simplex tree is a type of trie used to represent efficiently any general simplicial complex. Through its nodes, this data structure notably explicitly represents all the simplices. Its flexible structure allows implementation of many basic operations useful to computing persistent homology. This data structure was invented by Jean-Daniel Boissonnat and Clément Maria in 2014, in the article The Simplex Tree: An Efficient Data Structure for General Simplicial Complexes. This data structure offers efficient operations on sparse simplicial complexes. For dense or maximal simplices, Skeleton-Blocker representations or Toplex Map representations are used.

References

↑ Rubinchik, Mikhail; Shur, Arseny M. (2015). "Eertree: An Efficient Data Structure for Processing Palindromes in Strings". European Journal of Combinatorics . arXiv: 1506.04862v1 .
↑ Galil, Zvi; Seiferas, Joel (1978). "A Linear-Time On-Line Recognition Algorithm for Palstar". Journal of the ACM. 25 (1): 102–111. doi: 10.1145/322047.322056 . S2CID 41095273.
↑ Fici, Gabriele; Gagie, Travis; Kärkkäinen, Juha; Kempa, Dominik (2014). "A subquadratic algorithm for minimum palindromic factorization". Journal of Discrete Algorithms. 28: 41–48. arXiv: 1403.2431 . doi:10.1016/j.jda.2014.08.001. S2CID 14871164.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[MR&AS-1] Rubinchik, Mikhail; Shur, Arseny M. (2015). "Eertree: An Efficient Data Structure for Processing Palindromes in Strings". European Journal of Combinatorics . arXiv: 1506.04862v1 .

[kfactor-2] Galil, Zvi; Seiferas, Joel (1978). "A Linear-Time On-Line Recognition Algorithm for Palstar". Journal of the ACM. 25 (1): 102–111. doi: 10.1145/322047.322056 . S2CID 41095273.

[plength-3] Fici, Gabriele; Gagie, Travis; Kärkkäinen, Juha; Kempa, Dominik (2014). "A subquadratic algorithm for minimum palindromic factorization". Journal of Discrete Algorithms. 28: 41–48. arXiv: 1403.2431 . doi:10.1016/j.jda.2014.08.001. S2CID 14871164.

[1]

[2]

[3]

v t e Tree data structures
Search trees (dynamic sets/associative arrays)	2–3 2–3–4 AA (a,b) AVL B B+ B* B^x (Optimal) Binary search Dancing HTree Interval Order statistic Palindrome (Left-leaning) Red–black Scapegoat Splay T Treap UB Weight-balanced
Heaps	Binary Binomial Brodal d-ary Fibonacci Leftist Pairing Skew binomial Skew van Emde Boas Weak
Tries	Ctrie C-trie (compressed ADT) Hash Radix Suffix Ternary search X-fast Y-fast
Spatial data partitioning trees	Ball BK BSP Cartesian Hilbert R k-d (implicit k-d) M Metric MVP Octree PH Priority R Quad R R+ R* Segment VP X
Other trees	Cover Exponential Fenwick Finger Fractal tree index Fusion Hash calendar iDistance K-ary Left-child right-sibling Link/cut Log-structured merge Merkle PQ Range SPQR Top