Assessing machine learning algorithm performance

I wrote a few quick bullet points down from the¬†article “How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python“.


  • Classification accuracy
    • Test how well predictions of a model do overall
    • accuracy = correct predictions / total predictions
  • Confusion matrix
    • Use to identify how well your predictions did with different classes
    • Very useful if you have an imbalanced dataset
    • I wrote an extremely hacked together confusion matrix for my tag identification software. I had 4 classes (U, C, R, Q) and the confusion matrix shows you what your model predicted against what the real category was.

























  • Mean absolute error for regression
    • Positive values – the average of how much your predicted value differ from the real value
  • Root mean squared error for regression
    • Square root of the mean of squared differences between the actual and predicted value
    • Squaring the values gives you positive numbers and finding the root lets you compare the values to the original units.
Read More
Machine Learning Recipes

I found an excellent tutorial series on Machine Learning on the Google Developers YouTube channel this weekend. It uses Python, scikit-learn and tensorflow and covers decision trees and k-nearest neighbours (KNN).

I really liked the focus on understanding what was going on underneath the hood. I followed along and implemented KNN from scratch and expanded on the base class they described to include the ability to include k as a variable. You can find my implementation in a Jupyter Notebook here.

Sidenote: If you want to visualise the decision tree, you’ll need to install the following libraries. I used homebrew to install¬†graphviz but you could also use a package manger on Linux:

brew install graphviz
pip3 install pydotplus

Read More
Multiprocessing in Python

I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.

It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.

import multiprocessing

class ProcessHelper:
    def __init__(self, num_processes=4):
        self.num_processes = num_processes
    def split_list(self, data_list):
        list_of_lists = []
        for i in range(0, len(data_list), self.num_processes):
        return list_of_lists
    def map_reduce(self, function, data_list):
        split_data = self.split_list(data_list)
        processes = multiprocessing.Pool(processes=self.num_processes)
        results_list_of_lists =, split_data)
        results_list = [item for sublist in results_list_of_lists for item in sublist]
        return results_list

To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.

def job(num_list):
    return [i*i for i in num_list]

data = range(20)

p = ProcessHelper(4)
result = p.map_reduce(job, data)

So if my data originally was a list that looked like this:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):

[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]

Finally, the result will give me the list of squared values that looks like this:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

I’ll continue to build this class as I identify other handy helper methods that I could add.

Read More
Visual Diagnostics for More Informed Machine Learning

I recently watched Rebecca Bilbro’s presentation at PyCon 2016 and thought I’d share a few of my short notes from her interesting presentation.

Model Selection Triple

When selecting a model, rather than going with your default favourite method, take 3 things into account:

  • Feature analysis: intelligent feature selection and engineering
  • Model selection: model that makes most sense for problem/domain space
  • Hyperparameter Tuning: once model and features have been selected, select the parameters that result in optimal performance.

Visual Feature Analysis

  • Boxplots are a useful starting tool for looking at all features as they show you:
    • Central tendency
    • Distribution
    • Outliers
  • Histograms let you examine the distribution of a feature
  • Sploms: Pairwise plots of features to identify:
    • pairwise linear, quadratic and exponential relationships between variables
    • Homo/heteroscedasticity
    • How features are distributed relative to each other
  • Raduiz: Plot features around a circle and show how much pull they have
  • Parallel coordinates: lets you visualise multiple variables as line segments – you want to find separating chords which can help with classification

Evaluation Tools

  • Classification heat maps: show you areas where model is performing best
  • ROC-AUC and Prediction Error Plots: Show you which models are performing better
  • Residual plots: Show you which models are doing best and why
  • Gridsearch and validation curves: shows you the performance of a model along the parameters. You can create a visual heatmap for grid search
Read More
Multiple plots in figures with R

The other day I was working with R in a Jupyter Notebook when I discovered that I needed to include multiple figures in the same plot.

Surprisingly, R doesn’t include this capability out of the box, so I went searching and found this function that does the job. I’ve included the code below for my own future reference in case the linked site ever disappears.

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))

 if (numPlots==1) {

  } else {
    # Set up the page
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))

Then to call the function, you just have to pass it the plots:

multiplot(p1, p2, p3, p4, cols=2)
Read More
Using VNC to view the Raspberry Pi GUI

One of my side projects has involved playing around with the camera on a Raspberry Pi and I realised that it would be handy to see the camera output. After finding out that VNC tight server would do the job, the first step was to install it on the Raspberry Pi:

sudo apt-get install tightvncserver

With that done, I had to launch it:

vncserver :1 -geometry 800x600 -depth 24

The first time you set it up, you should be asked to give it a new password. You’ll need to run the command above each time you restart your Raspberry Pi.

Finally, to run this from my Mac, I can go to Finder > Go > Connect to servers and put in the command below (change the username and ip address).

open vnc://pi@
Read More
Finding rows in dataframe with a 0 value using Pandas

Recently I needed to identify which of the rows in a CSV file contained 0 values. This was interesting because normally I tend to look at this problem within columns rather than rows. Pandas provides a neat solution to this which I’ll demonstrate below using this data as an example:

import pandas as pd

d = {'a': [2,0,2,3,0,5, 9], 'b': [5,0,1,0,11,4,6]}
df = pd.DataFrame(d)

This data frame should look like this:

   a   b
0  0   0
1  2   1
2  3   0
3  0  11
4  5   4
5  9   6

Now, the final step to subset the data based on the existence of a 0 value within a row is to use the apply function and to look to see if a 0 is in the row values:

import pandas as pd

zero_rows_df = df[df.apply(lambda row: 0 in row.values, axis=1)]

That’s it! You should now have a dataframe containing only the 0 value rows:

   a   b
1  0   0
3  3   0
4  0  11

Read More
When to use ‘is’ and ‘==’ in Python

One of the things that may seem confusing in Python is how there appears to be two ways to test if variables are the same: ‘==’ and ‘is’:

x = 200
y = 200
print(x == y)
print(x is y)

Both comparison methods returned True, so they do the same thing right? Well, not really. To illustrate this, I’ll change the integer value assigned:

x = 500
y = 500
print(x == y)
print(x is y)

Now we have ‘==’ returning True and ‘is’ returning False. What happened? Well, ‘==’ is referred to as the equality operator, and it checks that the values of the variables are the same. That’s why it returned True both times. On the other hand, ‘is’ is referred to as the identity operator and compares the id of the object that the variable points to. I have a more detailed post on object ids available here.

So the moral of this is that usually you just want to use the equality operator (‘==’), as there is no guarantee that variables storing the same number will also point to the same object (and therefore have the same object id). The only times you really need to use the ‘is’ operator is when you explicitly need to check if one variable is actually pointing to the same object as another variable.

Read More
Python object ids and mutable types

Did you know that every object in your Python program is given a unique identifier by the interpreter which you can return using the ‘id()’ function? Let’s see what happens when we assign variables to each other in Python and then print out the variable value and object id:

x = 5
y = x
print(x, id(x))
print(y, id(y))

Now I am printing out two things about each variable: the value it stores (the number 5) and the object id. The object id for both variables should be the same. This is telling us that both the x and y variables are referring to the same object. Now, what happens if I were to change the y variable?

x = 5
y = x
y += 1
print(x, id(x))
print(y, id(y))

Here I’ve added 1 to the y variable, and as you can see, both the value and object id of the y variable is now different to the x variable. This change has occurred because integers in Python are immutable data types. Integers, floats, strings and tuples are examples of immutable objects in Python. What this means is that when you change the number a variable refers to, the Python interpreter points the variable to a completely different object.

Now, the distinction between mutable and immutable objects in Python starts to become more important when you start working with mutable data types like lists:

x = [1]
y = x
print(x, id(x))
print(y, id(y))

The thing that may surprise a lot of people is that now the list both x and y points to has changed! Now the list has a 1 and a 5 in it, even though I only appended to y. This is very easy to get tripped up on if you don’t know what Python is really doing under the hood.

Now how do you get around this? Luckily the standard library has us covered with a handy package called ‘copy’:

import copy

x = [1]
y = copy.copy(x)
print(x, id(x))
print(y, id(y))

Once you use the ‘copy’ method on x, you can see from the output that y has a new object id and that changes you make to the y variable no longer affect x. Problem solved, right?

Well, there’s now just one final thing to consider – what happens if I have a list of lists and I use the copy method?

import copy

x = [[1, 2], [3, 4]]
y = copy.copy(x)
print(x, id(x))
print(y, id(y))

You can see that I’ve made a copy of the x variable, then appended to the first list in y (the one containing [1, 2]). However, when you see the output, you’ll notice that like before, the change made to y has also modified x. You can see the reason for this by looking at the object id of the first list of x and y:

print(x[0], id(x[0]))
print(y[0], id(y[0]))

Both lists refer to the same object, only the outer list was copied and received a new object id! To solve this, we’ll need to use the ‘deepcopy’ method. This will create unique copies of all the objects throughout the list – not just a copy of the outer list itself:

import copy

x = [[1, 2], [3, 4]]
y = copy.deepcopy(x)
print(x, id(x))
print(y, id(y))

Now we should see that any changes to the lists in x or y will not affect the variable anymore.

So how is this useful? The reason I decided to write this post this evening was because some of the code I was working on today involved initialising multiple dictionaries with many keys. I could have copied and pasted them, but it was a lot easier and more readable for me to create an initial variable, then use ‘deepcopy’ to initialise the other variables. If I hadn’t been aware of Python object ids and mutable types, then I could very easily have created a massive bug in my code by creating several variables that all modified the same dictionary.

Read More
Closing a frozen SSH session

Normally, when I leave an SSH session idle too long and it freezes the terminal, I just close that tab and start a new one. I never bothered to look up what I could do to unfreeze it and continue. I recently came across this article which showed the 3 steps you need to follow to unfreeze your terminal in this situation. I’ve included the steps below so I can easily find it again in the future.

  1. [enter]
  2. ~
  3. .


Read More