Deep learning is a type of machine learning based on neural networks which were inspired by neurons in the brain. The difference between a deep neural network and a normal natural network is the number of ‘hidden layers’ between the input and output layers.
I recently watched an excellent presentation on Deep Learning by Roelof Pieters titled ‘Python for image and text understanding; One model to rule them all!‘ I can recommend watching it, and I’ve written this post for me to put down a few of my own bullet points from the talk for future reference.
Roelof had a 5 point process for training a deep neural network:
- Preprocess the Data
- Try to do as little as possible – the more transformations you do, the less you allow the network to come up with its own representations of the data: the more raw the better
- Mean subtraction normalisation
- Divide by standard deviation
- If you data is noisy you may want to do some PCA and whitening: reduce the dimensions
- Compute statistics on training data but apply on all (training and test) data
- Choose the architecture
- Three choices:
- Deep Belief Network (DBN): Series of restricted Boltzmann machines (RBM). Useful for hierarchical data like medical or audio datasets
- Convolutional Net (CNN): Convolutional layers are small filters/crops of the image that you sum together. Useful for images.
- Recurrent Net (RNN): Form of Hidden Markov Model. Useful for natural language processing.
- Train
- Assign layer definitions and layer parameters, learning rate etc
- Optimise/Regularise
- Move between Optimise/Regularise step and training step whilst improving
- Visualise loss curve – lasagne comes with functions to achieve this
- Visualise accuracy
- Can visualise weights: with images you want to see edges in the first layer
- Can optimise hyperparameters
- Grid Search (Won’t work for millions of parameters)
- Random search (Takes a long time)
- Bayesian optimisation (seems to work the best, spearmint and hypergrad libraries available)
- Data augmentation: With images you can scale, rotate, contrast, flip
- Dropout: randomly switch off nodes: allows the network to adapt
- Batch normalisation
- Tips/Tricks
- Ensembles: Train multiple models and can allow them to vote on prediction or take the average vote with continuous data. Ensure classifiers are not correlated.