K-optimal pattern discovery

Last updated April 16, 2021

K-optimal pattern discovery is a data mining technique that provides an alternative to the frequent pattern discovery approach that underlies most association rule learning techniques.

Frequent pattern discovery techniques find all patterns for which there are sufficiently frequent examples in the sample data. In contrast, k-optimal pattern discovery techniques find the k patterns that optimize a user-specified measure of interest. The parameter k is also specified by the user.

Examples of k-optimal pattern discovery techniques include:

k-optimal classification rule discovery.^[1]
k-optimal subgroup discovery.^[2]
finding k most interesting patterns using sequential sampling.^[3]
mining top.k frequent closed patterns without minimum support.^[4]
k-optimal rule discovery.^[5]

In contrast to k-optimal rule discovery and frequent pattern mining techniques, subgroup discovery focuses on mining interesting patterns with respect to a specified target property of interest. This includes, for example, binary, nominal, or numeric attributes,^[6] but also more complex target concepts such as correlations between several variables. Background knowledge^[7] like constraints and ontological relations can often be successfully applied for focusing and improving the discovery results.

Related Research Articles

Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

Cluster analysis Task of grouping a set of objects so that objects in the same group (or cluster) are more similar to each other than to those in other clusters

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Web mining is the application of data mining techniques to discover patterns from the World Wide Web. As the name proposes, this is information gathered by mining the web. It makes utilization of automated apparatuses to reveal and extricate data from servers and web2 reports, and it permits organizations to get to both organized and unstructured information from browser activities, server logs, website and link structure, page content and different sources.

Learning classifier systems, or LCS, are a paradigm of rule-based machine learning methods that combine a discovery component with a learning component. Learning classifier systems seek to identify a set of context-dependent rules that collectively store and apply knowledge in a piecewise manner in order to make predictions. This approach allows complex solution spaces to be broken up into smaller, simpler parts.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in data set. The output depends on whether k-NN is used for classification or regression:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.

In data analysis, anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.

Evolutionary data mining, or genetic data mining is an umbrella term for any data mining using evolutionary algorithms. While it can be used for mining data from DNA sequences, it is not limited to biological contexts and can be used in any classification-based prediction scenario, which helps "predict the value ... of a user-specified goal attribute based on the values of other attributes." For instance, a banking institution might want to predict whether a customer's credit would be "good" or "bad" based on their age, income and current savings. Evolutionary algorithms for data mining work by creating a series of random rules to be checked against a training dataset. The rules which most closely fit the data are selected and are mutated. The process is iterated many times and eventually, a rule will arise that approaches 100% similarity with the training data. This rule is then checked against a test dataset, which was previously invisible to the genetic algorithm.

Data mining in agriculture is a very recent research topic. It consists in the application of data mining techniques to agriculture. Recent technologies are nowadays able to provide a lot of information on agricultural-related activities, which can then be analyzed in order to find important information. A related, but not equivalent term is precision agriculture.

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user to label new data points with the desired outputs. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.

In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Link analysis has been used for investigation of criminal activity, computer security analysis, search engine optimization, market research, medical research, and art.

Educational data mining (EDM) describes a research field concerned with the application of data mining, machine learning and statistics to information generated from educational settings. At a high level, the field seeks to develop and improve methods for exploring this data, which often has multiple levels of meaningful hierarchy, in order to discover new insights about how people learn in the context of such settings. In doing so, EDM has contributed to theories of learning investigated by researchers in educational psychology and the learning sciences. The field is closely tied to that of learning analytics, and the two have been compared and contrasted.

Geoffrey I. Webb is Professor of Computer Science at Monash University, founder and director of Data Mining software development and consultancy company G. I. Webb and Associates, and former eEditor-in-chief of the journal Data Mining and Knowledge Discovery. Before joining Monash University he was on the faculty at Griffith University from 1986 to 1988 and then at Deakin University from 1988 to 2002.

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support is over a given threshold. It is a more general form of the maximum agreement subtree problem.

Social media mining is the process of obtaining big data from user-generated content on social media sites and mobile apps in order to extract patterns, form conclusions about users, and act upon the information, often for the purpose of advertising to users or conducting research. The term is an analogy to the resource extraction process of mining for rare minerals. Resource extraction mining requires mining companies to sift through vast quantities of raw ore to find the precious minerals; likewise, social media mining requires human data analysts and automated software programs to sift through massive amounts of raw social media data in order to discern patterns and trends relating to social media usage, online behaviours, sharing of content, connections between individuals, online buying behaviour, and more. These patterns and trends are of interest to companies, governments and not-for-profit organizations, as these organizations can use these patterns and trends to design their strategies or introduce new programs, new products, processes or services.

Frequent pattern discovery is part of knowledge discovery in databases, Massive Online Analysis, and data mining; it describes the task of finding the most frequent and relevant patterns in large datasets. The concept was first introduced for mining transaction databases. Frequent patterns are defined as subsets that appear in a data set with frequency no less than a user-specified or auto-determined threshold.

References

↑ Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431-465.
↑ Wrobel, Stefan (1997) An algorithm for multi-relational discovery of subgroups. In Proceedings First European Symposium on Principles of Data Mining and Knowledge Discovery. Springer.
↑ Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833-862.
↑ Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002) Mining top-k frequent closed patterns without minimum support. In Proceedings of the International Conference on Data Mining, pp. 211-218.
↑ Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39-79.
↑ Kloesgen, W. (1996). "EXPLORA: A multipattern and multistrategy discovery assistant". Advances in Knowledge Discovery and Data Mining. pp. 249–271. Retrieved 2021-04-14.
↑ Atzmueller, Martin; Puppe, Frank; Buscher, Hans-Peter (1 August 2005). "Exploiting background knowledge for knowledge-intensive subgroup discovery" (PDF). Proceedings of the 19th international joint conference on Artificial intelligence. Morgan Kaufmann Publishers. pp. 647–652.

External links

"Bringing you the state-of-the-art in Data Science". Bringing you the state-of-the-art in Data Science. 2017-05-06. Retrieved 2021-04-14.
Atzmueller, Martin (2015-05-17). "VIKAMINE: Subgroup Discovery and Analytics". VIKAMINE. Retrieved 2021-04-14.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431-465.

[2] Wrobel, Stefan (1997) An algorithm for multi-relational discovery of subgroups. In Proceedings First European Symposium on Principles of Data Mining and Knowledge Discovery. Springer.

[3] Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833-862.

[4] Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002) Mining top-k frequent closed patterns without minimum support. In Proceedings of the International Conference on Data Mining, pp. 211-218.

[5] Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39-79.

[6] Kloesgen, W. (1996). "EXPLORA: A multipattern and multistrategy discovery assistant". Advances in Knowledge Discovery and Data Mining. pp. 249–271. Retrieved 2021-04-14.

[7] Atzmueller, Martin; Puppe, Frank; Buscher, Hans-Peter (1 August 2005). "Exploiting background knowledge for knowledge-intensive subgroup discovery" (PDF). Proceedings of the 19th international joint conference on Artificial intelligence. Morgan Kaufmann Publishers. pp. 647–652.

[1]

[2]

[3]

[4]

[5]

[6]

[7]