![]() | This article may be too technical for most readers to understand.(July 2025) |
In statistics, synthetic minority oversampling technique (SMOTE) is a method for oversampling samples when dealing with imbalanced classification categories within a dataset. The problem with doing statistical inference and modeling on imbalanced datasets is that the inferences and results from those analyses will be biased towards the majority class. Other solutions to addressing the problem of imbalanced data is to do undersampling of the majority class to be equivalently represented in the data with the minority class. Instead of undersampling the majority class, SMOTE oversamples the minority class. [1] [2]
SMOTE does come with some limitations and challenges: [3]
Two variations to the SMOTE algorithm were proposed in the initial SMOTE paper: [2]
Other variations include: [4]
The SMOTE algorithm can be abstracted with the following pseudocode: [2]
if N < 100; then Randomize the T minority class samples T = (N/100) ∗ T N = 100 endif N = (int)(N/100) k = Number of nearest neighbors numattrs = Number of attributes Sample[ ][ ]: array for original minority class samples newindex: keeps a count of number of synthetic samples generated, initialized to 0 Synthetic[ ][ ]: array for synthetic samples for i <- 1 to T Compute k nearest neighbors for i, and save the indices in the nnarray Populate(N , i, nnarray) endfor Populate(N, i, nnarray): while N != 0 Choose a random number between 1 and k, call it nn for attr <- 1 to numattrs Compute: dif = Sample[nnarray[nn]][attr] − Sample[i][attr] Compute: gap = random number between 0 and 1 Synthetic[newindex][attr] = Sample[i][attr] + gap ∗ dif endfor newindex++ N = N − 1 endwhile return
where
N
is the amount of SMOTE, where the amount of SMOTE is assumed to be a multiple of one hundredT
is the number of minority class samplesk
is the number of nearest neighborsPopulate()
is the generating function for new synthetic minority samplesIf N is less than 100%, the minority class samples will be randomized, as only a random subset of them will have SMOTE applied to them.
Since the introduction of the SMOTE method, there have been a number of software implementations:
{{cite book}}
: |journal=
ignored (help)