I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.
Tactics
- Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
- If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
- Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
- One potential solution to this problem is to collect more data
- Resampling your dataset (can be random or non-random – stratified):
- Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
- Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
- Try different algorithms: decision trees can perform well on imbalanced datasets
- Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
- There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.