2.4K
I was reading a paper by Pedro Domingos this evening which had some tips and advice for people using machine learning. I’ve written down some bullet points for my own reference and I hope someone else finds it useful. I know I’ve made some of the mistakes he gives advice about avoiding.
- Overfitting
- Never forget that your ultimate is to generalise beyond the data
- Beginners will frequently make the mistake of testing on training data and think their model is a success
- Ensure that you set some data aside from the start to test your selected and tuned classifier
- Easy to contaminate your testing dataset by running frequent tests as you tune the hyperparameters of your model
- Using cross-validation you can test differently tuned classifiers on subsets of the data
- Features
- Most important factor is the features you train your classifier on
- Typically you need to do some processing, as raw data frequently is in a format that is not immediately useful
- Most of your time will probably be spent focussed around cleaning data and feature engineering
- More Data > Smarter Algorithms
- If the accuracy of your model is not adequate you can either change/modify the model or train it on more data
- More data is usually better, but can be time consuming to gather and clean
- Understand multiple models
- Try to understand a variety models
- When tackling a problem, try to solve it with simpler learners:
- naive bayes -> logistic regression -> k-nearest neighbour -> SVM
- Fancy learners may be interesting but can increase the complexity and function as black boxes
- Often the best learner for a problem can vary based on the goals of the project and the data you have access to