Synthetic data that feels real -- the standard method for fixing class imbalances in machine learning.
SMOTE stands for Synthetic Minority Over-sampling Technique and was introduced by Chawla et al. in 2002. It is probably the most widely used method to address class imbalances in machine learning datasets. Instead of simply duplicating existing minority examples (which would lead to overfitting), SMOTE generates synthetic but plausible new examples.
The mechanism is elegant: For each example of the minority class, SMOTE finds the k nearest neighbors of the same type. Then a random point on the line between the original and a neighbor is generated. The result is a synthetic example that lies in the feature space between existing examples -- plausible, but new.
Ben Kraiem et al. (2023) used SMOTE in their study to balance the class imbalance between Traditional and Agile projects. By synthetically generating additional Agile examples, the Gradient Boosting model could more robustly learn the differences between the two methods.
SMOTE has three decisive advantages over simple oversampling:
There are still limitations: SMOTE can generate implausible examples in high-dimensional spaces when the neighborhood is not meaningfully defined. Also, for extremely rare events (e.g., 0.01% fraud), SMOTE alone is not enough -- specialized variants like Borderline-SMOTE or ADASYN help here.
Aversight uses SMOTE techniques as part of its data pipeline, but with an important addition: Instead of pure feature-based interpolation, we work with temporal and structural constraints. A synthetic budget escalation example must be plausible over time -- budget curves do not follow linear interpolation. Therefore, we combine SMOTE with domain-specific rules that ensure generated examples also make sense in the project context.
30 seconds -- and we will get back to you within 24 hours.
Start Free Maturity Check →