Synthetic minority oversampling technique

Last updated

In statistics, synthetic minority oversampling technique (SMOTE) is a method for oversampling samples when dealing with imbalanced classification categories within a dataset. The problem with doing statistical inference and modeling on imbalanced datasets is that the inferences and results from those analyses will be biased towards the majority class. Other solutions to addressing the problem of imbalanced data is to do undersampling of the majority class to be equivalently represented in the data with the minority class. Instead of undersampling the majority class, SMOTE oversamples the minority class. [1] [2]

Contents

Limitations

SMOTE does come with some limitations and challenges: [3]

Variations

Two variations to the SMOTE algorithm were proposed in the initial SMOTE paper: [2]

Other variations include: [4]

Algorithm

The SMOTE algorithm can be abstracted with the following pseudocode: [2]

if N < 100; then     Randomize the T minority class samples     T = (N/100) ∗ T     N = 100 endif N = (int)(N/100) k = Number of nearest neighbors numattrs = Number of attributes Sample[ ][ ]: array for original minority class samples newindex: keeps a count of number of synthetic samples generated, initialized to 0 Synthetic[ ][ ]: array for synthetic samples for i <- 1 to T     Compute k nearest neighbors for i, and save the indices in the nnarray     Populate(N , i, nnarray) endfor Populate(N, i, nnarray):     while N != 0         Choose a random number between 1 and k, call it nn         for attr <- 1 to numattrs             Compute: dif = Sample[nnarray[nn]][attr] − Sample[i][attr]             Compute: gap = random number between 0 and 1             Synthetic[newindex][attr] = Sample[i][attr] + gap ∗ dif         endfor         newindex++         N = N − 1     endwhile     return 

where

If N is less than 100%, the minority class samples will be randomized, as only a random subset of them will have SMOTE applied to them.

Implementations

Since the introduction of the SMOTE method, there have been a number of software implementations:

See also

References

  1. Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2011-06-09), "SMOTE: Synthetic Minority Over-sampling Technique", Journal of Artificial Intelligence Research, 16: 321–357, arXiv: 1106.1813 , doi:10.1613/jair.953
  2. 1 2 3 Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2002-06-01). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. 16: 321–357. arXiv: 1106.1813 . doi:10.1613/jair.953. ISSN   1076-9757.
  3. Alkhawaldeh, Ibraheem M.; Albalkhi, Ibrahem; Naswhan, Abdulqadir Jeprel (2023-12-20). "Challenges and limitations of synthetic minority oversampling techniques in machine learning". World Journal of Methodology. 13 (5): 373–378. doi: 10.5662/wjm.v13.i5.373 . PMC   10789107 . PMID   38229946.
  4. "Over-sampling methods — Version 0.13.0". imbalanced-learn.org. Retrieved 2025-07-16.
  5. 1 2 Elreedy, Dina; Atiya, Amir F. (2019-12-01). "A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance" . Information Sciences. 505: 32–64. doi:10.1016/j.ins.2019.07.070. ISSN   0020-0255.
  6. He, Haibo; Bai, Yang; Garcia, Edwardo A.; Li, Shutao (2008-06-01). "ADASYN: Adaptive synthetic sampling approach for imbalanced learning". 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE. pp. 1322–1328. doi:10.1109/ijcnn.2008.4633969. ISBN   978-1-4244-1820-6.
  7. Han, Hui; Wang, Wen-Yuan; Mao, Bing-Huan (2005-08-23). "Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning". Advances in Intelligent Computing. ICIC'05. Vol. Part I. Berlin, Heidelberg: Springer-Verlag. pp. 878–887. doi:10.1007/11538059_91. ISBN   978-3-540-28226-6.{{cite book}}: |journal= ignored (help)
  8. 1 2 Batista, Gustavo E. A. P. A.; Prati, Ronaldo C.; Monard, Maria Carolina (2004-06-01). "A study of the behavior of several methods for balancing machine learning training data" . SIGKDD Explor. Newsl. 6 (1): 20–29. doi:10.1145/1007730.1007735. ISSN   1931-0145.