Working with Imbalanced Classes

I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.

Tactics

  • Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
  • If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
  • Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
  • One potential solution to this problem is to collect more data
  • Resampling your dataset (can be random or non-random – stratified):
    • Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
    • Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
  • Try different algorithms: decision trees can perform well on imbalanced datasets
  • Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
  • There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.
The following two tabs change content below.
Computational biology PhD candidate at the Australian National University. I love writing (both articles and software), learning more about the world around us, and beekeeping. I also write for BioSky.co

Latest posts by Jack Simpson (see all)

Comments are closed.