Home Data Science Notes from ‘A Few Useful Things to Know about Machine Learning’

Notes from ‘A Few Useful Things to Know about Machine Learning’

by Jack Simpson

I was reading a paper by Pedro Domingos this evening which had some tips and advice for people using machine learning. I’ve written down some bullet points for my own reference and I hope someone else finds it useful. I know I’ve made some of the mistakes he gives advice about avoiding.

  • Overfitting
    • Never forget that your ultimate is to generalise beyond the data
    • Beginners will frequently make the mistake of testing on training data and think their model is a success
    • Ensure that you set some data aside from the start to test your selected and tuned classifier
    • Easy to contaminate your testing dataset by running frequent tests as you tune the hyperparameters of your model
    • Using cross-validation you can test differently tuned classifiers on subsets of the data
  • Features
    • Most important factor is the features you train your classifier on
    • Typically you need to do some processing, as raw data frequently is in a format that is not immediately useful
    • Most of your time will probably be spent focussed around cleaning data and feature engineering
  • More Data > Smarter Algorithms
    • If the accuracy of your model is not adequate you can either change/modify the model or train it on more data
    • More data is usually better, but can be time consuming to gather and clean
  • Understand multiple models
    • Try to understand a variety models
    • When tackling a problem, try to solve it with simpler learners:
      • naive bayes -> logistic regression -> k-nearest neighbour -> SVM
      • Fancy learners may be interesting but can increase the complexity and function as black boxes
    • Often the best learner for a problem can vary based on the goals of the project and the data you have access to

You may also like