Home Data Science Working with Imbalanced Classes

Working with Imbalanced Classes

by Jack Simpson

I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.

Tactics

  • Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
  • If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
  • Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
  • One potential solution to this problem is to collect more data
  • Resampling your dataset (can be random or non-random – stratified):
    • Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
    • Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
  • Try different algorithms: decision trees can perform well on imbalanced datasets
  • Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
  • There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.

You may also like