I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.
Content
BioSky is a website I’ve been setting up with Rina Soetanto recently. We are both doing PhDs in the biological sciences and have a keen interest in research and the treatments being developed in the biotech and health industry.
I wrote a few quick bullet points down from the article “8 Proven Ways for improving the “Accuracy” of a Machine Learning Model” for future reference.
Improving Accuracy
- Add more data
- Fix missing values
- Continuous: impute with median/mean/mode
- Categorical: treat as separate class
- Predict missing classes with k-nearest neighbours
- Outliers
- Delete
- Bin
- Impute
- Treat as separate to the others
- Feature engineering
- Transform and normalise: scale between 0-1
- Eliminate skewness (e.g. log) for algorithms that require normal distribution
- Create features: Date of transactions might not be useful but day of the week may be
- Feature selection
- Best features to use: identify via visualisation or through domain knowledge
- Significance: Use p-values and other metrics to identify the right values. Can also use dimensionally reduction while preserving relationships in the data
- Test multiple machine learning algorithms and tune their parameters
- Ensemble methods: combine multiple week predictors (bagging and boosting)
I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.
Tactics
- Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
- If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
- Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
- One potential solution to this problem is to collect more data
- Resampling your dataset (can be random or non-random – stratified):
- Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
- Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
- Try different algorithms: decision trees can perform well on imbalanced datasets
- Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
- There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.
I wrote a few quick bullet points down from the article “How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python“.
Metrics
- Classification accuracy
- Test how well predictions of a model do overall
- accuracy = correct predictions / total predictions
- Confusion matrix
- Use to identify how well your predictions did with different classes
- Very useful if you have an imbalanced dataset
- I wrote an extremely hacked together confusion matrix for my tag identification software. I had 4 classes (U, C, R, Q) and the confusion matrix shows you what your model predicted against what the real category was.
U |
C |
R |
Q |
|
U |
175 |
17 |
67 |
1 |
C |
11 |
335 |
14 |
0 |
R |
26 |
8 |
298 |
0 |
Q |
6 |
0 |
3 |
93 |
- Mean absolute error for regression
- Positive values – the average of how much your predicted value differ from the real value
- Root mean squared error for regression
- Square root of the mean of squared differences between the actual and predicted value
- Squaring the values gives you positive numbers and finding the root lets you compare the values to the original units.
I found an excellent tutorial series on Machine Learning on the Google Developers YouTube channel this weekend. It uses Python, scikit-learn and tensorflow and covers decision trees and k-nearest neighbours (KNN).
I really liked the focus on understanding what was going on underneath the hood. I followed along and implemented KNN from scratch and expanded on the base class they described to include the ability to include k as a variable. You can find my implementation in a Jupyter Notebook here.
Sidenote: If you want to visualise the decision tree, you’ll need to install the following libraries. I used homebrew to install graphviz but you could also use a package manger on Linux:
brew install graphviz pip3 install pydotplus
I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.
It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.
import multiprocessing class ProcessHelper: def __init__(self, num_processes=4): self.num_processes = num_processes def split_list(self, data_list): list_of_lists = [] for i in range(0, len(data_list), self.num_processes): list_of_lists.append(data_list[i:i+self.num_processes]) return list_of_lists def map_reduce(self, function, data_list): split_data = self.split_list(data_list) processes = multiprocessing.Pool(processes=self.num_processes) results_list_of_lists = processes.map(function, split_data) processes.close() results_list = [item for sublist in results_list_of_lists for item in sublist] return results_list
To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.
def job(num_list): return [i*i for i in num_list] data = range(20) p = ProcessHelper(4) result = p.map_reduce(job, data) print(result)
So if my data originally was a list that looked like this:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]
Finally, the result will give me the list of squared values that looks like this:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
I’ll continue to build this class as I identify other handy helper methods that I could add.
I recently watched Rebecca Bilbro’s presentation at PyCon 2016 and thought I’d share a few of my short notes from her interesting presentation.
Model Selection Triple
When selecting a model, rather than going with your default favourite method, take 3 things into account:
- Feature analysis: intelligent feature selection and engineering
- Model selection: model that makes most sense for problem/domain space
- Hyperparameter Tuning: once model and features have been selected, select the parameters that result in optimal performance.
Visual Feature Analysis
- Boxplots are a useful starting tool for looking at all features as they show you:
- Central tendency
- Distribution
- Outliers
- Histograms let you examine the distribution of a feature
- Sploms: Pairwise plots of features to identify:
- pairwise linear, quadratic and exponential relationships between variables
- Homo/heteroscedasticity
- How features are distributed relative to each other
- Raduiz: Plot features around a circle and show how much pull they have
- Parallel coordinates: lets you visualise multiple variables as line segments – you want to find separating chords which can help with classification
Evaluation Tools
- Classification heat maps: show you areas where model is performing best
- ROC-AUC and Prediction Error Plots: Show you which models are performing better
- Residual plots: Show you which models are doing best and why
- Gridsearch and validation curves: shows you the performance of a model along the parameters. You can create a visual heatmap for grid search
The other day I was working with R in a Jupyter Notebook when I discovered that I needed to include multiple figures in the same plot.
Surprisingly, R doesn’t include this capability out of the box, so I went searching and found this function that does the job. I’ve included the code below for my own future reference in case the linked site ever disappears.
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) { library(grid) # Make a list from the ... arguments and plotlist plots <- c(list(...), plotlist) numPlots = length(plots) # If layout is NULL, then use 'cols' to determine layout if (is.null(layout)) { # Make the panel # ncol: Number of columns of plots # nrow: Number of rows needed, calculated from # of cols layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, nrow = ceiling(numPlots/cols)) } if (numPlots==1) { print(plots[[1]]) } else { # Set up the page grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:numPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } } }
Then to call the function, you just have to pass it the plots:
multiplot(p1, p2, p3, p4, cols=2)
One of my side projects has involved playing around with the camera on a Raspberry Pi and I realised that it would be handy to see the camera output. After finding out that VNC tight server would do the job, the first step was to install it on the Raspberry Pi:
sudo apt-get install tightvncserver
With that done, I had to launch it:
vncserver :1 -geometry 800x600 -depth 24
The first time you set it up, you should be asked to give it a new password. You’ll need to run the command above each time you restart your Raspberry Pi.
Finally, to run this from my Mac, I can go to Finder > Go > Connect to servers and put in the command below (change the username and ip address).
open vnc://pi@10.20.66.94:5901