Longest palindromic substring

Last updated

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings (that is, all substrings that are themselves palindromes and cannot be extended to larger palindromic substrings) rather than returning only one substring or returning the maximum length of a palindromic substring.

Contents

Manacher (1975) invented an -time algorithm for listing all the palindromes that appear at the start of a given string of length . However, as observed e.g., by Apostolico, Breslauer & Galil (1995), the same algorithm can also be used to find all maximal palindromic substrings anywhere within the input string, again in time. Therefore, it provides an -time solution to the longest palindromic substring problem. Alternative -time solutions were provided by Jeuring (1994), and by Gusfield (1997), who described a solution based on suffix trees. A faster algorithm can be achieved in the word RAM model of computation if the size of the input alphabet is in . In particular, this algorithm runs in time using space. [1] Efficient parallel algorithms are also known for the problem. [2]

The longest palindromic substring problem should not be confused with the different problem of finding the longest palindromic subsequence.

Slower algorithm

This algorithm is slower than Manacher's algorithm, but is a good stepping stone for understanding Manacher's algorithm. It looks at each character as the center of a palindrome and loops to determine the largest palindrome with that center.

The loop at the center of the function only works for palindromes where the length is an odd number. The function works for even-length palindromes by modifying the input string. The character '|' is inserted between every character in the inputs string, and at both ends. So the input "book" becomes "|b|o|o|k|". The even-length palindrome "oo" in "book" becomes the odd-length palindrome "|o|o|".

    Longest_Palindrome_SLOW(string S, string S') {         // S' == S with a bogus character (eg. '|') inserted          // between each character (including outer boundaries)          // The radius of the longest palindrome centered on each place in S'         // note: length(S') = length(PalindromeRadii) = 2 × length(S) + 1          array PalindromeRadii = [0,...,0]                  Center = 0          while Center < length(S') {             // Determine the longest palindrome starting              // at Center-Radius and going to Center+Radius             Radius = 0              while Center-(Radius + 1) >= 0 and                    Center+(Radius + 1) < length(S') and                    S'[Center-(Radius + 1)] = S'[Center+(Radius + 1)] {                 Radius = Radius + 1             }                          // Save the radius of the longest palindrome in the array              PalindromeRadii[Center] = Radius                          Center = Center + 1         }                              // One can show that longest_palindrome_in_S is max(PalindromeRadii).         // if S'[i] == '|', PalindromeRadii[i] is even, otherwise you could increase PalindromeRadii[i] by 1,         // which is equivalent to inserting an extra '|' in each border.         // Remember that a palindrome centered in an '|' in S' corresponds to an even palindrome in S.         // if S'[i] != '|', PalindromeRadii[i] is odd (same argument), and corresponds to an odd palindrome.         // In this case, the length of the palindrome         // centered at that character is also x=PalindromeRadii[i], as we have (x-1)/2 characters on each side,         // plus the extra middle one ((x-1)/2*2+1=x)         longest_palindrome_in_S = max(PalindromeRadii)          return longest_palindrome_in_S      }

The runtime of this algorithm is . The outer loop runs times and the inner loop can run up to times.

Manacher's algorithm

Below is the pseudocode for Manacher's algorithm. The algorithm is faster than the previous algorithm because it exploits when a palindrome happens inside another palindrome.

For example, consider the input string "abacaba". By the time it gets to the "c", Manacher's algorithm will have identified the length of every palindrome centered on the letters before the "c". At the "c", it runs a loop to identify the largest palindrome centered on the "c": "abacaba". With that knowledge, everything after the "c" looks like the reflection of everything before the "c". The "a" after the "c" has the same longest palindrome as the "a" before the "c". Similarly, the "b" after the "c" has a longest palindrome that is at least the length of the longest palindrome centered on the "b" before the "c". There are some special cases to consider, but that trick speeds up the computation dramatically.[ citation needed ]

    Longest_Palindrome(string S, string S') {         // S' == S with a bogus character (eg. '|') inserted          // between each character (including outer boundaries)          // The radius of the longest palindrome centered on each place in S'         // note: length(S') = length(PalindromeRadii) = 2 × length(S) + 1          array PalindromeRadii = [0,...,0]                  Center = 0         Radius = 0                  while Center < length(S') {      // At the start of the loop, Radius is already set to a lower-bound      // for the longest radius. In the first iteration, Radius is 0, but      // it can be higher on later iterations.                   // Determine the longest palindrome starting at Center-Radius and      // going to Center+Radius              while Center-(Radius+1) >= 0 and             Center+(Radius+1) < length(S') and      S'[Center-(Radius+1)] = S'[Center+(Radius+1)] {                 Radius = Radius+1             }                                    // Save the radius of the longest palindrome in the array             PalindromeRadii[Center] = Radius                          // Below, Center is incremented.             // If any precomputed values can be reused, they are.             // Also, Radius may be set to a value greater than 0                          OldCenter = Center             OldRadius = Radius             Center = Center+1       // Radius' default value will be 0, if we reach the end of the      // following loop.              Radius = 0               while Center <= OldCenter + OldRadius {    // Because Center lies inside the old palindrome and every   // character inside a palindrome has a "mirrored" character   // reflected across its center, we can use the data that was   // precomputed for the Center's mirrored point.                   MirroredCenter = OldCenter - (Center - OldCenter)                 MaxMirroredRadius = OldCenter + OldRadius - Center                  if PalindromeRadii[MirroredCenter] < MaxMirroredRadius {        // The palindrome centered at MirroredCenter is entirely       // contained in the palindrome centered at OldCenter So,       // MirroredCenter and Center have the same sized palindrome                      PalindromeRadii[Center] = PalindromeRadii[MirroredCenter]                     Center = Center+1                 } else if PalindromeRadii[MirroredCenter] > MaxMirroredRadius {                        // The palindrome at MirroredCenter extends beyond the       // palindrome at OldCenter The palindrome at Center must       // end at the edge of the OldCenter palindrome Otherwise,       // the palindrome at OldCenter would be bigger                      PalindromeRadii[Center] = MaxMirroredRadius                     Center = Center+1                 } else { // PalindromeRadii[MirroredCenter] = MaxMirroredRadius        // Since the palindrome at MirroredCenter ends exactly at       // the edge of the palindrome centered at OldCenter, the       // palindrome at Center might be bigger Set Radius to the       // minimum size of the palindrome at Center so it doesn't       // recheck unnecessarily                       Radius = MaxMirroredRadius                     break                 }                }               }   // A palindrome's size is equal to its radius * 2. However, since our  // variable Radius considers our bogus characters to the side of the  // center, the size of its corresponding palindrome is actually 2 *  // (Radius / 2), which means a palindrome's size is equal to its  // corresponding Radius value in PalindromeRadii          longest_palindrome_in_S = max(PalindromeRadii)         return longest_palindrome_in_S     }

Special cases

Manacher's algorithm is faster because it reuses precomputed data when a palindrome exists inside another palindrome. There are 3 cases of this. They are represented by the "if / else if / else" statement in the pseudocode.

The first case is when the palindrome at MirroredCenter lies completely inside the "Old" palindrome. In this situation, the palindrome at Center will have the same length as the one at MirroredCenter. For example, if the "Old" palindrome is "abcbpbcba", we can see that the palindrome centered on "c" after the "p" must have the same length as the palindrome centered on the "c" before the "p".

The second case is when the palindrome at MirroredCenter extends outside the "Old" palindrome. That is, it extends "to the left" (or, contains characters with a lower index than any inside the "Old" palindrome). Because the "Old" palindrome is the largest possible palindrome centered on OldCenter, we know the characters before and after it are different. Thus, the palindrome at Center will run exactly up to the border of the "Old" palindrome, because the next character will be different than the one inside the palindrome at MirroredCenter. For example, if the string was "ababc", the "Old" palindrome could be "bab" with the Center being the second "b" and the MirroredCenter being the first "b". Since the palindrome at the MirroredCenter is "aba" and extends beyond the boundaries of the "Old" palindrome, we know the longest palindrome at the second "b" can only extend up to the border of the "Old" palindrome. We know this because if the character after the "Old" palindrome had been an "a" instead of a "c", the "Old" palindrome would have been longer.

The third and last case is when the palindrome at MirroredCenter extends exactly up to the border of the "Old" palindrome. In this case, we don't know if the character after the "Old" palindrome might make the palindrome at Center longer than the one at MirroredCenter. But we do know that the palindrome at Center is at least as long as the one at MirroredCenter. In this case, Radius is initialized to the radius of the palindrome at MirroredCenter and the search starts from there. An example string would be "abcbpbcbp" where the "Old" palindrome is "bcbpbcb" and the Center is on the second "c". The MirroredCenter is the first "c" and it has a longest palindrome of "bcb". The longest palindrome at the Center on the second "c" has to be at least that long and, in this case, is longer.

Runtime

The algorithm runs in linear time. This can be seen by noting that Center strictly increases after each outer loop and the sum Center + Radius is non-decreasing. Moreover, the number of operations in the first inner loop is linear in the increase of the sum Center + Radius while the number of operations in the second inner loop is linear in the increase of Center. Since Center ≤ 2n+1 and Radius ≤ n, the total number of operations in the first and second inner loops is and the total number of operations in the outer loop, other than those in the inner loops, is also . The overall running time is therefore .

Notes

  1. Charalampopoulos, Panagiotis; Pissis, Solon P.; Radoszewski, Jakub (Jun 2022). Bannai, Hideo; Holub, Jan (eds.). Longest Palindromic Substring in Sublinear Time. Combinatorial Pattern Matching. Leibniz International Proceedings in Informatics (LIPIcs). Vol. 223. Schloss Dagstuhl. doi:10.4230/LIPIcs.CPM.2022.20. Here: Theorem 1, p.20:2.
  2. Crochemore & Rytter (1991), Apostolico, Breslauer & Galil (1995).

See also

Related Research Articles

<span class="mw-page-title-main">Kolmogorov complexity</span> Measure of algorithmic complexity

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963 and is a generalization of classical information theory.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

<span class="mw-page-title-main">Longest common subsequence</span> Algorithmic problem on pairs of sequences

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

<span class="mw-page-title-main">Time complexity</span> Estimate of time taken for running an algorithm

In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by the algorithm, supposing that each elementary operation takes a fixed amount of time to perform. Thus, the amount of time taken and the number of elementary operations performed by the algorithm are taken to be related by a constant factor.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965.

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In computer science, Ukkonen's algorithm is a linear-time, online algorithm for constructing suffix trees, proposed by Esko Ukkonen in 1995. The algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string, adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property. The original algorithm presented by Peter Weiner proceeded backward from the last character to the first one from the shortest to the longest suffix. A simpler algorithm was found by Edward M. McCreight, going from the longest to the shortest suffix.

In formal language theory, deterministic context-free languages (DCFL) are a proper subset of context-free languages. They are the context-free languages that can be accepted by a deterministic pushdown automaton. DCFLs are always unambiguous, meaning that they admit an unambiguous grammar. There are non-deterministic unambiguous CFLs, so DCFLs form a proper subset of unambiguous CFLs.

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the Needleman–Wunsch algorithm that uses divide and conquer. Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

In computer science, the two-way string-matching algorithm is a string-searching algorithm, discovered by Maxime Crochemore and Dominique Perrin in 1991. It takes a pattern of size m, called a “needle”, preprocesses it in linear time O(m), producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O(n) with n being the haystack's length.

In computer science a palindrome tree, also called an EerTree, is a type of search tree, that allows for fast access to all palindromes contained in a string. They can be used to solve the longest palindromic substring, the k-factorization problem, palindromic length of a string, and finding and counting all distinct sub-palindromes. Palindrome trees do this in an online manner, that is it does not require the entire string at the start and can be added to character by character.

References