Working with Imbalanced Classes

by Jack Simpson December 11, 2016

written by Jack Simpson December 11, 2016

I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.

Tactics

Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
One potential solution to this problem is to collect more data
Resampling your dataset (can be random or non-random – stratified):
- Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
- Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
Try different algorithms: decision trees can perform well on imbalanced datasets
Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.

Working with Imbalanced Classes

Sign up to my newsletter

You may also like