Glossary

Imbalanced Data

When 95% of your projects are green and only 5% are red, the model learns to always say "green." That is not a bug -- it is math.

Definition

Imbalanced data describes a machine learning problem where the target classes in a dataset are very unevenly distributed. A classic example: In a portfolio of 100 projects, 95 are stable and 5 are critical. A model that always predicts "stable" achieves 95% accuracy -- and is still worthless because it does not recognize a single critical situation.

The problem is widespread. Credit card fraud (0.1% of all transactions), machine failures (2% of all devices), project escalations (5% of all projects) -- in every case, the minority class is the actually interesting one but statistically underrepresented.

Ben Kraiem et al. (2023) had a similar problem: 61 Traditional projects vs. 38 Agile projects. Without countermeasures, the model would systematically favor Traditional -- regardless of actual project characteristics.

Why it matters

Imbalanced data leads to three practical problems:

Solutions are varied: SMOTE (synthetic data generation), cost-sensitive learning (higher penalty for errors in the minority class), or simply choosing the right evaluation metrics (precision, recall, F1-score instead of accuracy).

Aversight and Imbalanced Data

Aversight addresses imbalanced data on three levels: First, through cost-sensitive learning -- a missed budget alert is weighted more heavily than a false alarm. Second, through dynamic threshold adjustment: when the escalation rate rises in a quarter, the system automatically lowers the alert threshold. Third, through continuous retraining: every new escalation event immediately flows into the model, so the minority class steadily grows and is better learned.

Related terms

Risk intelligence is not a black box. Let us show you how it works.

30 seconds -- and we will get back to you within 24 hours.

Start Free Maturity Check →