Anatree

Last updated September 27, 2021

An anatree^[1] is a data structure designed to solve anagrams. Solving an anagram is the problem of finding a word from a given list of letters. These problems are commonly encountered in word games like Scrabble or in newspaper crossword puzzles. The problem for the wordwheel also has the condition that the central letter appear in all the words framed with the given set. Some other conditions may be introduced regarding the frequency (number of appearances) of each of the letters in the given input string. These problems are classified as Constraint satisfaction problem in computer science literature.

An anatree is represented as a directed tree which contains a set of words (W) encoded as strings in some alphabet. The internal vertices are labelled with some letter in the alphabet and the leaves contain words. The edges are labelled with non-negative integers. An anatree has the property that the sum of the edge labels from the root to the leaf is the length of the word stored at the leaf. If the internal vertices are labelled as $\alpha _{1}$ , $\alpha _{2}$ ... $\alpha _{l}$ , and the edge labels are $n_{1}$ , $n_{2}$ ... $n_{l}$ , then the path from the root to the leaf along these vertices and edges are a list of words that contain $n_{1}$ $\alpha _{1}$ s, $n_{2}$ $\alpha _{2}$ s and so on. Anatrees are intended to be read only data structures with all the words available at construction time.

Mixed anatree Image of a mixed anatree.png — Mixed anatree

A mixed anatree is an anatree where the internal vertices also store words. A mixed anatree can have words of varying lengths, where as in a regular anatree, all words are of the same length.

Data structures

A number of data structures have been proposed to solve anagrams in constant time. Two of the most commonly used data structures are the alphabetic map and the frequency map.

The alphabetic map maintains a hash table of all the possible words that can be in the language (this is referred to as the lexicon). For a given input string, sort the letters in alphabetic order. This sorted string maps onto a word in the hash table. Hence finding the anagram requires sorting the letters and looking up the word in the hash table. The sorting can be done in linear time with counting sort and hash table look ups can be done in constant time. For example, given the word ANATREE, the alphabetic map would produce a mapping of $\{AAEENRT->\{''anatree''\}\}$ .

A frequency map also stores the list of all possible words in the lexicon in a hash table. For a given input string, the frequency map maintains the frequencies (number of appearances) of all the letters and uses this count to perform a look up in the hash table. The worst case execution time is found to be linear in size of the lexicon. For example, given the word ANATREE, the alphabetic map would produce a mapping of $f(A)->2,f(E)->2,f(N)->1,f(R)->1,f(T)->1$ . The words that do not appear in the string are not written in the map.

Construction

The construction of an anatree begins by selecting a label for the root and partitioning words based on the label chosen for the root. This process is repeated recursively for all the labels of the tree. Anatree construction is non-canonical for a given set of words, depending on the label chosen for the root, the anatree will differ accordingly. The performance of the anatree is greatly impacted by the choice of labels.

The following are some heuristics for choosing labels:

Start labeling vertices in alphabetical order from the root. This approach reduces construction overhead
Start labeling vertices based on the relative frequency. A probabilistic approach is used to assign labels to vertices. If $W_{n}^{\alpha }$ is the set of words that contain $n\alpha s$ , then we label the vertex with $\alpha$ if it maximizes the expected distance to the leaf. This approach has the most frequently appearing characters (like E) labeled at the root and the least frequently appearing characters labeled at the leaves. The following equation is maximized $D_{\alpha }=\sum _{n}n{\frac {|W_{n}^{\alpha }|}{|W|}}$ . This approach prevents long sequences of zero labeled edges since they do not contribute letters to the words generated by the anatree.

Finding anagrams

To find a word in an anatree, start at the root, depending on the frequency of the label in the given input string, follow the edge that has that frequency till the leaf. The leaf contains the required word. For example, consider the anatree in the figure, to find the word $dog$ , the given string may be $ogd$ . Start at the root and follow the edge that has $1$ as the label. We follow this label since the given input string has $1$ $d$ . Traverse this edge until the leaf is encountered. That gives the required word.

Space and time requirements

A lexicon that stores $w$ words (each word can be $l$ characters long) in an alphabet $O$ has the following space requirements.

Data Structure	Space Requirements
Alphabetic Map	$O(wl(log\|O\|)$
Frequency Map	$O(w(\|O\|logl+llog\|O\|))$
Anatree	$O(w\|O\|(log\|O\|+l))$

The worst case execution time of an anatree is $O(|w|(l+w|O|^{2}))$

External links

Related Research Articles

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

In graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one path, or equivalently a connected acyclic undirected graph. A forest is an undirected graph in which any two vertices are connected by at most one path, or equivalently an acyclic undirected graph, or equivalently a disjoint union of trees.

The Post correspondence problem is an undecidable decision problem that was introduced by Emil Post in 1946. Because it is simpler than the halting problem and the Entscheidungsproblem it is often used in proofs of undecidability.

Breadth-first search Algorithm for searching the nodes of a graph in order by their hop count from a starting node

Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

Root system Geometric arrangements of points, foundational to Lie theory

In mathematics, a root system is a configuration of vectors in a Euclidean space satisfying certain geometrical properties. The concept is fundamental in the theory of Lie groups and Lie algebras, especially the classification and representation theory of semisimple Lie algebras. Since Lie groups and Lie algebras have become important in many parts of mathematics during the twentieth century, the apparently special nature of root systems belies the number of areas in which they are applied. Further, the classification scheme for root systems, by Dynkin diagrams, occurs in parts of mathematics with no overt connection to Lie theory. Finally, root systems are important for their own sake, as in spectral graph theory.

Weyl group Subgroup of a root systems isometry group

In mathematics, in particular the theory of Lie algebras, the Weyl group of a root system Φ is a subgroup of the isometry group of that root system. Specifically, it is the subgroup which is generated by reflections through the hyperplanes orthogonal to the roots, and as such is a finite reflection group. In fact it turns out that most finite reflection groups are Weyl groups. Abstractly, Weyl groups are finite Coxeter groups, and are important examples of these.

In mathematics, a Coxeter group, named after H. S. M. Coxeter, is an abstract group that admits a formal description in terms of reflections. Indeed, the finite Coxeter groups are precisely the finite Euclidean reflection groups; the symmetry groups of regular polyhedra are an example. However, not all Coxeter groups are finite, and not all can be described in terms of symmetries and Euclidean reflections. Coxeter groups were introduced in 1934 as abstractions of reflection groups, and finite Coxeter groups were classified in 1935.

In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. Deterministic refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943.

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In graph theory, a bridge, isthmus, cut-edge, or cut arc is an edge of a graph whose deletion increases the graph's number of connected components. Equivalently, an edge is a bridge if and only if it is not contained in any cycle. For a connected graph, a bridge can uniquely determine a cut. A graph is said to be bridgeless or isthmus-free if it contains no bridges.

In mathematics, a Sturmian word, named after Jacques Charles François Sturm, is a certain kind of infinitely long sequence of characters. Such a sequence can be generated by considering a game of English billiards on a square table. The struck ball will successively hit the vertical and horizontal edges labelled 0 and 1 generating a sequence of letters. This sequence is a Sturmian word.

A zero-suppressed decision diagram is a particular kind of binary decision diagram (BDD) with fixed variable ordering. This data structure provides a canonically compact representation of sets, particularly suitable for certain combinatorial problems. Recall the Ordered Binary Decision Diagram (OBDD) reduction strategy, i.e. a node is removed from the decision tree if both out-edges point to the same node. In contrast, a node in a ZDD is removed if its positive edge points to the terminal node 0. This provides an alternative strong normal form, with improved compression of sparse sets. It is based on a reduction rule devised by Shin-ichi Minato in 1993.

In cryptography, SWIFFT is a collection of provably secure hash functions. It is based on the concept of the fast Fourier transform (FFT). SWIFFT is not the first hash function based on FFT, but it sets itself apart by providing a mathematical proof of its security. It also uses the LLL basis reduction algorithm. It can be shown that finding collisions in SWIFFT is at least as difficult as finding short vectors in cyclic/ideal lattices in the worst case. By giving a security reduction to the worst-case scenario of a difficult mathematical problem, SWIFFT gives a much stronger security guarantee than most other cryptographic hash functions.

The generalized distributive law (GDL) is a generalization of the distributive property which gives rise to a general message passing algorithm. It is a synthesis of the work of many authors in the information theory, digital communications, signal processing, statistics, and artificial intelligence communities. The law and algorithm were introduced in a semi-tutorial by Srinivas M. Aji and Robert J. McEliece with the same title.

In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression.

In coding theory, Zemor's algorithm, designed and developed by Gilles Zemor, is a recursive low-complexity approach to code construction. It is an improvement over the algorithm of Sipser and Spielman.

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string $is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.$

Each clue in a Jumble word puzzle is a word that has been “jumbled” by permuting the letters of each word to make an anagram. A dictionary of such anagrams may be used to solve puzzles or verify that a jumbled word is unique when creating puzzles.

An oblivious RAM (ORAM) simulator is a compiler that transforms algorithms in such a way that the resulting algorithms preserve the input-output behavior of the original algorithm but the distribution of memory access pattern of the transformed algorithm is independent of the memory access pattern of the original algorithm. The definition of ORAMs is motivated by the fact that an adversary can obtain nontrivial information about the execution of a program and the nature of the data that it is dealing with, just by observing the pattern in which various locations of memory are accessed during its execution. An adversary can get this information even if the data values are all encrypted. The definition suits equally well to the settings of protected programs running on unprotected shared memory as well as a client running a program on its system by accessing previously stored data on a remote server. The concept was formulated by Oded Goldreich in 1987.

References

↑ Reams, Charles (March 2012). "Anatree: A Fast Data Structure for Anagrams". Journal of Experimental Algorithmics. 17 (1): 2012. doi:10.1145/2133803.2133804.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Reams, Charles (March 2012). "Anatree: A Fast Data Structure for Anagrams". Journal of Experimental Algorithmics. 17 (1): 2012. doi:10.1145/2133803.2133804.

[1]