Sequential pattern mining

Last updated December 12, 2023

Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.^[1]^[2] It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.

There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning. Local process models^[3] extend sequential pattern mining to more complex patterns that can include (exclusive) choices, loops, and concurrency constructs in addition to the sequential ordering construct.

String mining

String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences. In biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties. Knowing the sequence of letters of a DNA or a protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when insertions, deletions and mutations occur in a string.

A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by Abouelhoda & Ghanem (2010), which include:^[4]

Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.
Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include BLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See sequence alignment.

Itemset mining

Some problems in sequence mining lend themselves to discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson up}, it is likely that {Motorola up and Samsung up} within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction".

A survey and taxonomy of the key algorithms for item set mining is presented by Han et al. (2007).^[5]

The two common techniques that are applied to sequence databases for frequent itemset mining are the influential apriori algorithm and the more-recent FP-growth technique.

Applications

With a great variation of products and user buying behaviors, shelf on which products are being displayed is one of the most important resources in retail environment. Retailers can not only increase their profit but, also decrease cost by proper management of shelf space allocation and products display. To solve this problem, George and Binu (2013) have proposed an approach to mine user buying patterns using PrefixSpan algorithm and place the products on shelves based on the order of mined purchasing patterns.^[6]

Algorithms

Commonly used algorithms include:

GSP algorithm
Sequential Pattern Discovery using Equivalence classes (SPADE)
FreeSpan
PrefixSpan
MAPres^[7]
Seq2Pat (for constraint-based sequential pattern mining)^[8]^[9]

Related Research Articles

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

<span class="mw-page-title-main">Substring</span> Contiguous part of a sequence of symbols

In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "the best of" is a substring of "It was the best of times". In contrast, "Itwastimes" is a subsequence of "It was the best of times", but not a substring.

Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. To compress a data sequence $, a grammar-based code transforms into a context-free grammar . The problem of finding a smallest grammar for an input sequence is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar is further compressed by statistical encoders like arithmetic coding.$

In computer science, the longest increasing subsequence problem aims to find a subsequence of a given sequence in which the subsequence's elements are sorted in an ascending order and in which the subsequence is as long as possible. This subsequence is not necessarily contiguous or unique. The longest increasing subsequences are studied in the context of various disciplines related to mathematics, including algorithmics, random matrix theory, representation theory, and physics. The longest increasing subsequence problem is solvable in time $where denotes the length of the input sequence.$

GSP algorithm is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the apriori (level-wise) algorithm. One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion. It simply means counting the occurrences of all singleton elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end of this step, each transaction consists of only the frequent elements it originally contained. This modified database becomes an input to the GSP algorithm. This process requires one pass over the whole database.

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings a and b are, the higher the value of a string kernel K(a, b) will be.

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support is over a given threshold. It is a more general form of the maximum agreement subtree problem.

Frequent pattern discovery is part of knowledge discovery in databases, Massive Online Analysis, and data mining; it describes the task of finding the most frequent and relevant patterns in large datasets. The concept was first introduced for mining transaction databases. Frequent patterns are defined as subsets that appear in a data set with frequency no less than a user-specified or auto-determined threshold.

References

↑ Mabroukeh, N. R.; Ezeife, C. I. (2010). "A taxonomy of sequential pattern mining algorithms". ACM Computing Surveys. 43: 1–41. CiteSeerX 10.1.1.332.4745 . doi:10.1145/1824795.1824798. S2CID 207180619.
↑ Bechini, A.; Bondielli, A.; Dell'Oglio, P.; Marcellonii, F. (2023). "From basic approaches to novel challenges and applications in Sequential Pattern Mining". Applied Computing and Intelligence. 3 (1): 44–78. doi: 10.3934/aci.2023004 .
↑ Tax, N.; Sidorova, N.; Haakma, R.; van der Aalst, Wil M. P. (2016). "Mining Local Process Models". Journal of Innovation in Digital Ecosystems. 3 (2): 183–196. arXiv: 1606.06066 . doi:10.1016/j.jides.2016.11.001. S2CID 10872379.
↑ Abouelhoda, M.; Ghanem, M. (2010). "String Mining in Bioinformatics". In Gaber, M. M. (ed.). Scientific Data Mining and Knowledge Discovery. Springer. doi:10.1007/978-3-642-02788-8_9. ISBN 978-3-642-02787-1.
↑ Han, J.; Cheng, H.; Xin, D.; Yan, X. (2007). "Frequent pattern mining: current status and future directions". Data Mining and Knowledge Discovery. 15 (1): 55–86. doi: 10.1007/s10618-006-0059-1 .
↑ George, A.; Binu, D. (2013). "An Approach to Products Placement in Supermarkets Using PrefixSpan Algorithm". Journal of King Saud University-Computer and Information Sciences. 25 (1): 77–87. doi: 10.1016/j.jksuci.2012.07.001 .
↑ Ahmad, Ishtiaq; Qazi, Wajahat M.; Khurshid, Ahmed; Ahmad, Munir; Hoessli, Daniel C.; Khawaja, Iffat; Choudhary, M. Iqbal; Shakoori, Abdul R.; Nasir-ud-Din (1 May 2008). "MAPRes: Mining association patterns among preferred amino acid residues in the vicinity of amino acids targeted for post-translational modifications". Proteomics. 8 (10): 1954–1958. doi:10.1002/pmic.200700657. PMID 18491291. S2CID 22362167.
↑ Hosseininasab A, van Hoeve WJ, Cire AA (2019). "Constraint-Based Sequential Pattern Mining with Decision Diagrams". Proceedings of the AAAI Conference on Artificial Intelligence. 33: 1495–1502. arXiv: 1811.06086 . doi: 10.1609/aaai.v33i01.33011495 . S2CID 53427299.
↑ "Seq2Pat: Sequence-to-Pattern Generation Library". GitHub . 9 April 2022.

External links

SPMF includes open-source implementations of GSP, PrefixSpan, SPADE, SPAM and many others.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Mabroukeh, N. R.; Ezeife, C. I. (2010). "A taxonomy of sequential pattern mining algorithms". ACM Computing Surveys. 43: 1–41. CiteSeerX 10.1.1.332.4745 . doi:10.1145/1824795.1824798. S2CID 207180619.

[2] Bechini, A.; Bondielli, A.; Dell'Oglio, P.; Marcellonii, F. (2023). "From basic approaches to novel challenges and applications in Sequential Pattern Mining". Applied Computing and Intelligence. 3 (1): 44–78. doi: 10.3934/aci.2023004 .

[3] Tax, N.; Sidorova, N.; Haakma, R.; van der Aalst, Wil M. P. (2016). "Mining Local Process Models". Journal of Innovation in Digital Ecosystems. 3 (2): 183–196. arXiv: 1606.06066 . doi:10.1016/j.jides.2016.11.001. S2CID 10872379.

[4] Abouelhoda, M.; Ghanem, M. (2010). "String Mining in Bioinformatics". In Gaber, M. M. (ed.). Scientific Data Mining and Knowledge Discovery. Springer. doi:10.1007/978-3-642-02788-8_9. ISBN 978-3-642-02787-1.

[5] Han, J.; Cheng, H.; Xin, D.; Yan, X. (2007). "Frequent pattern mining: current status and future directions". Data Mining and Knowledge Discovery. 15 (1): 55–86. doi: 10.1007/s10618-006-0059-1 .

[6] George, A.; Binu, D. (2013). "An Approach to Products Placement in Supermarkets Using PrefixSpan Algorithm". Journal of King Saud University-Computer and Information Sciences. 25 (1): 77–87. doi: 10.1016/j.jksuci.2012.07.001 .

[7] Ahmad, Ishtiaq; Qazi, Wajahat M.; Khurshid, Ahmed; Ahmad, Munir; Hoessli, Daniel C.; Khawaja, Iffat; Choudhary, M. Iqbal; Shakoori, Abdul R.; Nasir-ud-Din (1 May 2008). "MAPRes: Mining association patterns among preferred amino acid residues in the vicinity of amino acids targeted for post-translational modifications". Proteomics. 8 (10): 1954–1958. doi:10.1002/pmic.200700657. PMID 18491291. S2CID 22362167.

[hosseininasab2019-8] Hosseininasab A, van Hoeve WJ, Cire AA (2019). "Constraint-Based Sequential Pattern Mining with Decision Diagrams". Proceedings of the AAAI Conference on Artificial Intelligence. 33: 1495–1502. arXiv: 1811.06086 . doi: 10.1609/aaai.v33i01.33011495 . S2CID 53427299.

[9] "Seq2Pat: Sequence-to-Pattern Generation Library". GitHub . 9 April 2022.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Suffix array Suffix automaton Suffix tree Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations