I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.
Tactics
- Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
- If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
- Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
- One potential solution to this problem is to collect more data
- Resampling your dataset (can be random or non-random – stratified):
- Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
- Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
- Try different algorithms: decision trees can perform well on imbalanced datasets
- Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
- There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.
The following two tabs change content below.
Currently fascinated by energy markets and electrical engineering.
In another life I was a beekeeper that did a PhD in computational biology writing image analysis software and using machine learning to quantify honeybee behaviour in the hive.
Latest posts by Jack Simpson (see all)
- Exploring the impact of constraints on a solar farm in the National Electricity Market - September 3, 2021
- Optimisation and Energy Modelling Talks From JuliaCon 2021 - July 29, 2021
- How does AEMO predict demand in the National Electricity Market? - July 26, 2021