Granular computing is an emerging computing paradigm of information processing that concerns the processing of complex information entities called "information granules", which arise in the process of data abstraction and derivation of knowledge from information or data. Generally speaking, information granules are collections of entities that usually originate at the numeric level and are arranged together due to their similarity, functional or physical adjacency, indistinguishability, coherency, or the like.
At present, granular computing is more a theoretical perspective than a coherent set of methods or principles. As a theoretical perspective, it encourages an approach to data that recognizes and exploits the knowledge present in data at various levels of resolution or scales. In this sense, it encompasses all methods which provide flexibility and adaptability in the resolution at which knowledge or information is extracted and represented.
As mentioned above, granular computing is not an algorithm or process; there is no particular method that is called "granular computing". It is rather an approach to looking at data that recognizes how different and interesting regularities in the data can appear at different levels of granularity, much as different features become salient in satellite images of greater or lesser resolution. On a low-resolution satellite image, for example, one might notice interesting cloud patterns representing cyclones or other large-scale weather phenomena, while in a higher-resolution image, one misses these large-scale atmospheric phenomena but instead notices smaller-scale phenomena, such as the interesting pattern that is the streets of Manhattan. The same is generally true of all data: At different resolutions or granularities, different features and relationships emerge. The aim of granular computing is to try to take advantage of this fact in designing more effective machine-learning and reasoning systems.
There are several types of granularity that are often encountered in data mining and machine learning, and we review them below:
One type of granulation is the quantization of variables. It is very common that in data mining or machine-learning applications the resolution of variables needs to be decreased in order to extract meaningful regularities. An example of this would be a variable such as "outside temperature" (temp), which in a given application might be recorded to several decimal places of precision (depending on the sensing apparatus). However, for purposes of extracting relationships between "outside temperature" and, say, "number of health-club applications" (club), it will generally be advantageous to quantize "outside temperature" into a smaller number of intervals.
There are several interrelated reasons for granulating variables in this fashion:
For example, a simple learner or pattern recognition system may seek to extract regularities satisfying a conditional probability threshold such as In the special case where this recognition system is essentially detecting logical implication of the form or, in words, "if then ". The system's ability to recognize such implications (or, in general, conditional probabilities exceeding threshold) is partially contingent on the resolution with which the system analyzes the variables.
As an example of this last point, consider the feature space shown to the right. The variables may each be regarded at two different resolutions. Variable may be regarded at a high (quaternary) resolution wherein it takes on the four values or at a lower (binary) resolution wherein it takes on the two values Similarly, variable may be regarded at a high (quaternary) resolution or at a lower (binary) resolution, where it takes on the values or respectively. At the high resolution, there are no detectable implications of the form since every is associated with more than one and thus, for all However, at the low (binary) variable resolution, two bilateral implications become detectable: and , since every occurs iff and occurs iff Thus, a pattern recognition system scanning for implications of this kind would find them at the binary variable resolution, but would fail to find them at the higher quaternary variable resolution.
It is not feasible to exhaustively test all possible discretization resolutions on all variables in order to see which combination of resolutions yields interesting or significant results. Instead, the feature space must be preprocessed (often by an entropy analysis of some kind) so that some guidance can be given as to how the discretization process should proceed. Moreover, one cannot generally achieve good results by naively analyzing and discretizing each variable independently, since this may obliterate the very interactions that we had hoped to discover.
A sample of papers that address the problem of variable discretization in general, and multiple-variable discretization in particular, is as follows: Chiu, Wong & Cheung (1991), Bay (2001), Liu et al. (2002), Wang & Liu (1998), Zighed, Rabaséda & Rakotomalala (1998), Catlett (1991), Dougherty, Kohavi & Sahami (1995), Monti & Cooper (1999), Fayyad & Irani (1993), Chiu, Cheung & Wong (1990), Nguyen & Nguyen (1998), Grzymala-Busse & Stefanowski (2001), Ting (1994), Ludl & Widmer (2000), Pfahringer (1995), An & Cercone (1999), Chiu & Cheung (1989), Chmielewski & Grzymala-Busse (1996), Lee & Shin (1994), Liu & Wellman (2002), Liu & Wellman (2004).
Variable granulation is a term that could describe a variety of techniques, most of which are aimed at reducing dimensionality, redundancy, and storage requirements. We briefly describe some of the ideas here, and present pointers to the literature.
A number of classical methods, such as principal component analysis, multidimensional scaling, factor analysis, and structural equation modeling, and their relatives, fall under the genus of "variable transformation." Also in this category are more modern areas of study such as dimensionality reduction, projection pursuit, and independent component analysis. The common goal of these methods in general is to find a representation of the data in terms of new variables, which are a linear or nonlinear transformation of the original variables, and in which important statistical relationships emerge. The resulting variable sets are almost always smaller than the original variable set, and hence these methods can be loosely said to impose a granulation on the feature space. These dimensionality reduction methods are all reviewed in the standard texts, such as Duda, Hart & Stork (2001), Witten & Frank (2005), and Hastie, Tibshirani & Friedman (2001).
A different class of variable granulation methods derive more from data clustering methodologies than from the linear systems theory informing the above methods. It was noted fairly early that one may consider "clustering" related variables in just the same way that one considers clustering related data. In data clustering, one identifies a group of similar entities (using a "measure of similarity" suitable to the domain — Martino, Giuliani & Rizzi (2018)), and then in some sense replaces those entities with a prototype of some kind. The prototype may be the simple average of the data in the identified cluster, or some other representative measure. But the key idea is that in subsequent operations, we may be able to use the single prototype for the data cluster (along with perhaps a statistical model describing how exemplars are derived from the prototype) to stand in for the much larger set of exemplars. These prototypes are generally such as to capture most of the information of interest concerning the entities.
Similarly, it is reasonable to ask whether a large set of variables might be aggregated into a smaller set of prototype variables that capture the most salient relationships between the variables. Although variable clustering methods based on linear correlation have been proposed (Duda, Hart & Stork 2001;Rencher 2002), more powerful methods of variable clustering are based on the mutual information between variables. Watanabe has shown (Watanabe 1960;Watanabe 1969) that for any set of variables one can construct a polytomic (i.e., n-ary) tree representing a series of variable agglomerations in which the ultimate "total" correlation among the complete variable set is the sum of the "partial" correlations exhibited by each agglomerating subset (see figure). Watanabe suggests that an observer might seek to thus partition a system in such a way as to minimize the interdependence between the parts "... as if they were looking for a natural division or a hidden crack."
One practical approach to building such a tree is to successively choose for agglomeration the two variables (either atomic variables or previously agglomerated variables) which have the highest pairwise mutual information ( Kraskov et al. 2003 ). The product of each agglomeration is a new (constructed) variable that reflects the local joint distribution of the two agglomerating variables, and thus possesses an entropy equal to their joint entropy. (From a procedural standpoint, this agglomeration step involves replacing two columns in the attribute-value table—representing the two agglomerating variables—with a single column that has a unique value for every unique combination of values in the replaced columns ( Kraskov et al. 2003 ). No information is lost by such an operation; however, if one is exploring the data for inter-variable relationships, it would generally not be desirable to merge redundant variables in this way, since in such a context it is likely to be precisely the redundancy or dependency between variables that is of interest; and once redundant variables are merged, their relationship to one another can no longer be studied.
In database systems, aggregations (see e.g. OLAP aggregation and Business intelligence systems) result in transforming original data tables (often called information systems) into the tables with different semantics of rows and columns, wherein the rows correspond to the groups (granules) of original tuples and the columns express aggregated information about original values within each of the groups. Such aggregations are usually based on SQL and its extensions. The resulting granules usually correspond to the groups of original tuples with the same values (or ranges) over some pre-selected original columns.
There are also other approaches wherein the groups are defined basing on, e.g., physical adjacency of rows. For example, Infobright implemented a database engine wherein data was partitioned onto rough rows, each consisting of 64K of physically consecutive (or almost consecutive) rows. Rough rows were automatically labeled with compact information about their values on data columns, often involving multi-column and multi-table relationships. It resulted in a higher layer of granulated information where objects corresponded to rough rows and attributes - to various aspects of rough information. Database operations could be efficiently supported within such a new framework, with an access to the original data pieces still available ( Slezak et al. 2013 ).
The origins of the granular computing ideology are to be found in the rough sets and fuzzy sets literatures. One of the key insights of rough set research—although by no means unique to it—is that, in general, the selection of different sets of features or variables will yield different concept granulations. Here, as in elementary rough set theory, by "concept" we mean a set of entities that are indistinguishable or indiscernible to the observer (i.e., a simple concept), or a set of entities that is composed from such simple concepts (i.e., a complex concept). To put it in other words, by projecting a data set (value-attribute system) onto different sets of variables, we recognize alternative sets of equivalence-class "concepts" in the data, and these different sets of concepts will in general be conducive to the extraction of different relationships and regularities.
We illustrate with an example. Consider the attribute-value system below:
Object | |||||
---|---|---|---|---|---|
1 | 2 | 0 | 1 | 1 | |
1 | 2 | 0 | 1 | 1 | |
2 | 0 | 0 | 1 | 0 | |
0 | 0 | 1 | 2 | 1 | |
2 | 1 | 0 | 2 | 1 | |
0 | 0 | 1 | 2 | 2 | |
2 | 0 | 0 | 1 | 0 | |
0 | 1 | 2 | 2 | 1 | |
2 | 1 | 0 | 2 | 2 | |
2 | 0 | 0 | 1 | 0 |
When the full set of attributes is considered, we see that we have the following seven equivalence classes or primitive (simple) concepts:
Thus, the two objects within the first equivalence class, cannot be distinguished from one another based on the available attributes, and the three objects within the second equivalence class, cannot be distinguished from one another based on the available attributes. The remaining five objects are each discernible from all other objects. Now, let us imagine a projection of the attribute value system onto attribute alone, which would represent, for example, the view from an observer which is only capable of detecting this single attribute. Then we obtain the following much coarser equivalence class structure.
This is in a certain regard the same structure as before, but at a lower degree of resolution (larger grain size). Just as in the case of value granulation (discretization/quantization), it is possible that relationships (dependencies) may emerge at one level of granularity that are not present at another. As an example of this, we can consider the effect of concept granulation on the measure known as attribute dependency (a simpler relative of the mutual information).
To establish this notion of dependency (see also rough sets), let represent a particular concept granulation, where each is an equivalence class from the concept structure induced by attribute set Q. For example, if the attribute set Q consists of attribute alone, as above, then the concept structure will be composed of
The dependency of attribute set Q on another attribute set P, is given by
That is, for each equivalence class in we add up the size of its "lower approximation" (see rough sets) by the attributes in P, i.e., More simply, this approximation is the number of objects which on attribute set P can be positively identified as belonging to target set Added across all equivalence classes in the numerator above represents the total number of objects which—based on attribute set P—can be positively categorized according to the classification induced by attributes Q. The dependency ratio therefore expresses the proportion (within the entire universe) of such classifiable objects, in a sense capturing the "synchronization" of the two concept structures and The dependency "can be interpreted as a proportion of such objects in the information system for which it suffices to know the values of attributes in P to determine the values of attributes in Q" (Ziarko & Shan 1995).
Having gotten definitions now out of the way, we can make the simple observation that the choice of concept granularity (i.e., choice of attributes) will influence the detected dependencies among attributes. Consider again the attribute value table from above:
Object | |||||
---|---|---|---|---|---|
1 | 2 | 0 | 1 | 1 | |
1 | 2 | 0 | 1 | 1 | |
2 | 0 | 0 | 1 | 0 | |
0 | 0 | 1 | 2 | 1 | |
2 | 1 | 0 | 2 | 1 | |
0 | 0 | 1 | 2 | 2 | |
2 | 0 | 0 | 1 | 0 | |
0 | 1 | 2 | 2 | 1 | |
2 | 1 | 0 | 2 | 2 | |
2 | 0 | 0 | 1 | 0 |
Consider the dependency of attribute set on attribute set That is, we wish to know what proportion of objects can be correctly classified into classes of based on knowledge of The equivalence classes of and of are shown below.
The objects that can be definitively categorized according to concept structure based on are those in the set and since there are six of these, the dependency of Q on P, This might be considered an interesting dependency in its own right, but perhaps in a particular data mining application only stronger dependencies are desired.
We might then consider the dependency of the smaller attribute set on the attribute set The move from to induces a coarsening of the class structure as will be seen shortly. We wish again to know what proportion of objects can be correctly classified into the (now larger) classes of based on knowledge of The equivalence classes of the new and of are shown below.
Clearly, has a coarser granularity than it did earlier. The objects that can now be definitively categorized according to the concept structure based on constitute the complete universe , and thus the dependency of Q on P, That is, knowledge of membership according to category set is adequate to determine category membership in with complete certainty; In this case we might say that Thus, by coarsening the concept structure, we were able to find a stronger (deterministic) dependency. However, we also note that the classes induced in from the reduction in resolution necessary to obtain this deterministic dependency are now themselves large and few in number; as a result, the dependency we found, while strong, may be less valuable to us than the weaker dependency found earlier under the higher resolution view of
In general it is not possible to test all sets of attributes to see which induced concept structures yield the strongest dependencies, and this search must be therefore be guided with some intelligence. Papers which discuss this issue, and others relating to intelligent use of granulation, are those by Y.Y. Yao and Lotfi Zadeh listed in the #References below.
Another perspective on concept granulation may be obtained from work on parametric models of categories. In mixture model learning, for example, a set of data is explained as a mixture of distinct Gaussian (or other) distributions. Thus, a large amount of data is "replaced" by a small number of distributions. The choice of the number of these distributions, and their size, can again be viewed as a problem of concept granulation. In general, a better fit to the data is obtained by a larger number of distributions or parameters, but in order to extract meaningful patterns, it is necessary to constrain the number of distributions, thus deliberately coarsening the concept resolution. Finding the "right" concept resolution is a tricky problem for which many methods have been proposed (e.g., AIC, BIC, MDL, etc.), and these are frequently considered under the rubric of "model regularization".
Granular computing can be conceived as a framework of theories, methodologies, techniques, and tools that make use of information granules in the process of problem solving. In this sense, granular computing is used as an umbrella term to cover topics that have been studied in various fields in isolation. By examining all of these existing studies in light of the unified framework of granular computing and extracting their commonalities, it may be possible to develop a general theory for problem solving.
In a more philosophical sense, granular computing can describe a way of thinking that relies on the human ability to perceive the real world under various levels of granularity (i.e., abstraction) in order to abstract and consider only those things that serve a specific interest and to switch among different granularities. By focusing on different levels of granularity, one can obtain different levels of knowledge, as well as a greater understanding of the inherent knowledge structure. Granular computing is thus essential in human problem solving and hence has a very significant impact on the design and implementation of intelligent systems.
Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to , the entropy is
A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate the parameters.
A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.
In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers. Dichotomization is the special case of discretization in which the number of discrete classes is 2, which can approximate a continuous variable as a binary variable.
A commitment scheme is a cryptographic primitive that allows one to commit to a chosen value while keeping it hidden to others, with the ability to reveal the committed value later. Commitment schemes are designed so that a party cannot change the value or statement after they have committed to it: that is, commitment schemes are binding. Commitment schemes have important applications in a number of cryptographic protocols including secure coin flipping, zero-knowledge proofs, and secure computation.
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.
Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.
In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):
In computer science, a rough set, first described by Polish computer scientist Zdzisław I. Pawlak, is a formal approximation of a crisp set in terms of a pair of sets which give the lower and the upper approximation of the original set. In the standard version of rough set theory, the lower- and upper-approximation sets are crisp sets, but in other variations, the approximating sets may be fuzzy sets.
This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.
In probability theory and statistics Chow–Liu tree is an efficient method for constructing a second-order product approximation of a joint probability distribution, first described in a paper by Chow & Liu (1968). The goals of such a decomposition, as with such Bayesian networks in general, may be either data compression or inference.
The dominance-based rough set approach (DRSA) is an extension of rough set theory for multi-criteria decision analysis (MCDA), introduced by Greco, Matarazzo and Słowiński. The main change compared to the classical rough sets is the substitution for the indiscernibility relation by a dominance relation, which permits one to deal with inconsistencies typical to consideration of criteria and preference-ordered decision classes.
Algorithmic inference gathers new developments in the statistical inference methods made feasible by the powerful computing devices widely available to any data analyst. Cornerstones in this field are computational learning theory, granular computing, bioinformatics, and, long ago, structural probability . The main focus is on the algorithms which compute statistics rooting the study of a random phenomenon, along with the amount of data they must feed on to produce reliable results. This shifts the interest of mathematicians from the study of the distribution laws to the functional properties of the statistics, and the interest of computer scientists from the algorithms for processing data to the information they process.
The Mivar-based approach is a mathematical tool for designing artificial intelligence (AI) systems. Mivar was developed by combining production and Petri nets. The Mivar-based approach was developed for semantic analysis and adequate representation of humanitarian epistemological and axiological principles in the process of developing artificial intelligence. The Mivar-based approach incorporates computer science, informatics and discrete mathematics, databases, expert systems, graph theory, matrices and inference systems. The Mivar-based approach involves two technologies:
Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if they were based on variables considered sensitive. For example gender, ethnicity, sexual orientation or disability. As it is the case with many ethical concepts, definitions of fairness and bias are always controversial. In general, fairness and bias are considered relevant when the decision process impacts people's lives. In machine learning, the problem of algorithmic bias is well known and well studied. Outcomes may be skewed by a range of factors and thus might be considered unfair with respect to certain groups or individuals. An example would be the way social media sites deliver personalized news to consumers.
In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.
In network theory, collective classification is the simultaneous prediction of the labels for multiple objects, where each label is predicted using information about the object's observed features, the observed features and labels of its neighbors, and the unobserved labels of its neighbors. Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables. Inference is performed on multiple random variables simultaneously, typically by propagating information between nodes in the network to perform approximate inference. Approaches that use collective classification can make use of relational information when performing inference. Examples of collective classification include predicting attributes of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.
{{citation}}
: CS1 maint: location missing publisher (link).{{citation}}
: CS1 maint: location missing publisher (link).{{citation}}
: CS1 maint: location missing publisher (link).{{citation}}
: CS1 maint: location missing publisher (link).{{citation}}
: CS1 maint: location missing publisher (link).{{citation}}
: CS1 maint: location missing publisher (link).