Oversampling and undersampling in data analysis

Last updated

Within statistics, oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented). These terms are used both in statistical sampling, survey design methodology and in machine learning.

Contents

Oversampling and undersampling are opposite and roughly equivalent techniques. There are also more complex oversampling techniques, including the creation of artificial data points with algorithms like Synthetic minority oversampling technique. [1] [2]

Motivation for oversampling and undersampling

Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken. Data Imbalance can be of the following types:

  1. Under-representation of a class in one or more important predictor variables. Suppose, to address the question of gender discrimination, we have survey data on salaries within a particular field, e.g., computer software. It is known women are under-represented considerably in a random sample of software engineers, which would be important when adjusting for other variables such as years employed and current level of seniority. Suppose only 20% of software engineers are women, i.e., males are 4 times as frequent as females. If we were designing a survey to gather data, we would survey 4 times as many females as males, so that in the final sample, both genders will be represented equally. (See also Stratified Sampling.)
  2. Under-representation of one class in the outcome (dependent) variable. Suppose we want to predict, from a large clinical dataset, which patients are likely to develop a particular disease (e.g., diabetes). Assume, however, that only 10% of patients go on to develop the disease. Suppose we have a large existing dataset. We can then pick 9 times the number of patients who did not go on to develop the disease for every one patient who did.

Oversampling is generally employed more frequently than undersampling, especially when the detailed data has yet to be collected by survey, interview or otherwise. Undersampling is employed much less frequently. Overabundance of already collected data became an issue only in the "Big Data" era, and the reasons to use undersampling are mainly practical and related to resource costs. Specifically, while one needs a suitably large sample size to draw valid statistical conclusions, the data must be cleaned before it can be used. Cleansing typically involves a significant human component, and is typically specific to the dataset and the analytical problem, and therefore takes time and money. For example:

For these reasons, one will typically cleanse only as much data as is needed to answer a question with reasonable statistical confidence (see Sample Size), but not more than that.

Oversampling techniques for classification problems

Random oversampling

Random Oversampling involves supplementing the training data with multiple copies of some of the minority classes. Oversampling can be done more than once (2x, 3x, 5x, 10x, etc.) This is one of the earliest proposed methods, that is also proven to be robust. [3] Instead of duplicating every sample in the minority class, some of them may be randomly chosen with replacement.

SMOTE

There are a number of methods available to oversample a dataset used in a typical classification problem (using a classification algorithm to classify a set of images, given a labelled training set of images). The most common technique is known as SMOTE: Synthetic Minority Over-sampling Technique. [4] However, this technique has been shown to yield poorly calibrated models, with an overestimated probability to belong to the minority class. [5]

To illustrate how this technique works consider some training data which has s samples, and f features in the feature space of the data. Note that these features, for simplicity, are continuous. As an example, consider a dataset of birds for classification. The feature space for the minority class for which we want to oversample could be beak length, wingspan, and weight (all continuous). To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.

Many modifications and extensions have been made to the SMOTE method ever since its proposal. [6]

ADASYN

The adaptive synthetic sampling approach, or ADASYN algorithm, [7] builds on the methodology of SMOTE, by shifting the importance of the classification boundary to those minority classes which are difficult. ADASYN uses a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn.

Augmentation

Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model. [8] (See: Data augmentation)

Undersampling techniques for classification problems

Random undersampling

Randomly remove samples from the majority class, with or without replacement. This is one of the earliest techniques used to alleviate imbalance in the dataset, however, it may increase the variance of the classifier and is very likely to discard useful or important samples. [6]

Cluster

Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling.

Tomek links remove unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class. A Tomek link is defined as follows: given an instance pair , where and is the distance between and , then the pair is called a Tomek link if there's no instance such that or . In this way, if two instances form a Tomek link then either one of these instances is noise or both are near a border. Thus, one can use Tomek links to clean up overlap between classes. By removing overlapping examples, one can establish well-defined clusters in the training set and lead to improved classification performance.

Undersampling with ensemble learning

A recent study shows that the combination of Undersampling with ensemble learning can achieve better results, see IFME: information filtering by multiple examples with under-sampling in a digital library environment. [9]

Techniques for regression problems

Although sampling techniques have been developed mostly for classification tasks, growing attention is being paid to the problem of imbalanced regression. [10] Adaptations of popular strategies are available, including undersampling, oversampling and SMOTE. [11] [12] Sampling techniques have also been explored in the context of numerical prediction in dependency-oriented data, such as time series forecasting [13] and spatio-temporal forecasting. [14]

Additional techniques

It's possible to combine oversampling and undersampling techniques into a hybrid strategy. Common examples include SMOTE and Tomek links or SMOTE and Edited Nearest Neighbors (ENN). Additional ways of learning on imbalanced datasets include weighing training instances, introducing different misclassification costs for positive and negative examples and bootstrapping. [15]

Implementations

Criticism

Poor models in [the binary classification] setting are often a result of—any combination of—fitting deterministic classifiers, using re-sampling or re-weighting methods to balance class frequencies in the training data and evaluating the model with a score such as accuracy. ... No re-sampling technique will magically generate more information out of the few cases with the rare class.

Model Comparison and Calibration Assessment User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice, Tobias Fissler, arXiv:2202.12780v3, Christian Lorentzen, Michael Mayer, 2023

Probabilistic machine learning models trying to model a conditional distribution (through Bayes rule) will be wrongly calibrated if modifying the natural distribution during training by applying undersampling or downsampling. [16]

This point can be illustrated with a simple example: Assume no predictive variables and where the proportion of is 0.01 and the proportion of is 0.99. Is a model which learns useless and should be modified via undersampling or oversampling? The answer is no. Class imbalance is not a problem in itself at all.

Additionally,

  1. oversampling
  2. undersampling
  3. as well as assigning weights to samples

may be applied by practitioners in multi-class classification or situations with very imbalanced cost structure. This might be done in order to achieve "desireable", best performances for each class (potentially measured as precision and recall in each class). Finding the best multi-class classification performance or the best tradeoff between precision and recall is, however, inherently a multi-objective optimization problem. It is well known that these problems typically have multiple incomparable Pareto optimal solutions. Oversampling or undersampling as well as assigning weights to samples is an implicit way to find a certain pareto optimum (and it sacrifices the calibration of the estimated probabilities). A more explicit way than oversampling or downsampling could be to select a Pareto optimum by

See also

Literature

Related Research Articles

<span class="mw-page-title-main">Supervised learning</span> Paradigm in machine learning

Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

<span class="mw-page-title-main">Cross-validation (statistics)</span> Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<span class="mw-page-title-main">Regression analysis</span> Set of statistical processes for estimating the relationships among variables

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more error-free independent variables. The most common form of regression analysis is linear regression, in which one finds the line that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line that minimizes the sum of squared differences between the true data and that line. For specific mathematical reasons, this allows the researcher to estimate the conditional expectation of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters or estimate the conditional expectation across a broader collection of non-linear models.

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency. Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data. There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; non-negative matrix factorization; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set.

Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk".

<span class="mw-page-title-main">Linear discriminant analysis</span> Method used in statistics, pattern recognition, and other fields

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

In statistics and econometrics, panel data and longitudinal data are both multi-dimensional data involving measurements over time. Panel data is a subset of longitudinal data where observations are for the same subjects each time.

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. That is, no parametric form is assumed for the relationship between predictors and dependent variable. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of several classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.

Discriminative models, also referred to as conditional models, are a class of models frequently used for classification. They are typically used to solve binary classification problems, i.e. assign labels, such as pass/fail, win/lose, alive/dead or healthy/sick, to existing datapoints.

In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

In machine learning, local case-control sampling is an algorithm used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. It assumes the availability of a (unreliable) pilot estimation of the parameters. It then performs a single pass over the entire dataset using the pilot estimation to identify the most "surprising" samples. In practice, the pilot may come from prior knowledge or training using a subsample of the dataset. The algorithm is most effective when the underlying dataset is imbalanced. It exploits the structures of conditional imbalanced datasets more efficiently than alternative methods, such as case control sampling and weighted case control sampling.

In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.

Weak supervision is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

References

  1. 1 2 "Scikit-learn-contrib/Imbalanced-learn". GitHub . 25 October 2021.
  2. 1 2 "Analyticalmindsltd/Smote_variants". GitHub . 26 October 2021.
  3. Ling, Charles X., and Chenghui Li. "Data mining for direct marketing: Problems and solutions." Kdd. Vol. 98. 1998.
  4. Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2002-06-01). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. 16: 321–357. arXiv: 1106.1813 . doi:10.1613/jair.953. ISSN   1076-9757. S2CID   1554582.
  5. van den Goorbergh, Ruben; van Smeden, Maarten; Timmerman, Dirk; Van Calster, Ben (2022-09-01). "The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression". Journal of the American Medical Informatics Association. 29 (9): 1525–1534. doi:10.1093/jamia/ocac093. ISSN   1527-974X. PMC   9382395 . PMID   35686364.
  6. 1 2 Chawla, Nitesh V.; Herrera, Francisco; Garcia, Salvador; Fernandez, Alberto (2018-04-20). "SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary". Journal of Artificial Intelligence Research. 61: 863–905. doi: 10.1613/jair.1.11192 . hdl: 10481/56411 . ISSN   1076-9757.
  7. He, Haibo; Bai, Yang; Garcia, Edwardo A.; Li, Shutao (June 2008). "ADASYN: Adaptive synthetic sampling approach for imbalanced learning" (PDF). 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). pp. 1322–1328. doi:10.1109/IJCNN.2008.4633969. ISBN   978-1-4244-1820-6. S2CID   1438164 . Retrieved 5 December 2022.
  8. Shorten, Connor; Khoshgoftaar, Taghi M. (2019). "A survey on Image Data Augmentation for Deep Learning". Mathematics and Computers in Simulation. 6. springer: 60. doi: 10.1186/s40537-019-0197-0 .
  9. Zhu, Mingzhu; Xu, Chao; Wu, Yi-Fang Brook (2013-07-22). IFME: information filtering by multiple examples with under-sampling in a digital library environment. ACM. pp. 107–110. doi:10.1145/2467696.2467736. ISBN   9781450320771. S2CID   13279787.
  10. Ribeiro, Rita P.; Moniz, Nuno (2020-09-01). "Imbalanced regression and extreme value prediction". Machine Learning. 109 (9): 1803–1835. doi: 10.1007/s10994-020-05900-9 . ISSN   1573-0565. S2CID   222143074.
  11. Torgo, Luís; Branco, Paula; Ribeiro, Rita P.; Pfahringer, Bernhard (June 2015). "Resampling strategies for regression". Expert Systems. 32 (3): 465–476. doi:10.1111/exsy.12081. S2CID   205129966.
  12. Torgo, Luís; Ribeiro, Rita P.; Pfahringer, Bernhard; Branco, Paula (2013). "SMOTE for Regression". In Correia, Luís; Reis, Luís Paulo; Cascalho, José (eds.). Progress in Artificial Intelligence. Lecture Notes in Computer Science. Vol. 8154. Berlin, Heidelberg: Springer. pp. 378–389. doi:10.1007/978-3-642-40669-0_33. hdl: 10289/8518 . ISBN   978-3-642-40669-0. S2CID   16253787.
  13. Moniz, Nuno; Branco, Paula; Torgo, Luís (2017-05-01). "Resampling strategies for imbalanced time series forecasting". International Journal of Data Science and Analytics. 3 (3): 161–181. doi: 10.1007/s41060-017-0044-3 . ISSN   2364-4168. S2CID   25975914.
  14. Oliveira, Mariana; Moniz, Nuno; Torgo, Luís; Santos Costa, Vítor (2021-09-01). "Biased resampling strategies for imbalanced spatio-temporal forecasting". International Journal of Data Science and Analytics. 12 (3): 205–228. doi:10.1007/s41060-021-00256-2. ISSN   2364-4168. S2CID   210931099.
  15. Haibo He; Garcia, E.A. (2009). "Learning from Imbalanced Data". IEEE Transactions on Knowledge and Data Engineering. 21 (9): 1263–1284. doi:10.1109/TKDE.2008.239. S2CID   206742563.
  16. "Imbalance correction led to models with strong miscalibration without better ability to distinguish between patients with and without the outcome event. The inaccurate probability estimates reduce the clinical utility of the model, because decisions about treatment are ill-informed.", The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression, 2022, Ruben van den Goorbergh, Maarten van Smeden, Dirk Timmerman, Ben Van Calster https://doi.org/10.1093/jamia/ocac093
  17. Encyclopedia of Machine Learning. (2011). Deutschland: Springer. Page 193, https://books.google.com/books?id=i8hQhp1a62UC&pg=PT193
  18. Elor, Yotam; Averbuch-Elor, Hadar (2022). "To SMOTE, or not to SMOTE?". arXiv: 2201.08528v3 [cs.LG].
  19. Guillaume Lemaitre EuroSciPy 2023 - Get the best from your scikit-learn classifier https://www.youtube.com/watch?v=6YnhoCfArQo