Announcing BioSky

BioSky is a website I’ve been setting up with Rina Soetanto recently. We are both doing PhDs in the biological sciences and have a keen interest in research and the treatments being developed in the biotech and health industry.

Read More
Improving Model Accuracy

I wrote a few quick bullet points down from the article “8 Proven Ways for improving the “Accuracy” of a Machine Learning Model” for future reference.

Improving Accuracy

  • Add more data
  • Fix missing values
    • Continuous: impute with median/mean/mode
    • Categorical: treat as separate class
    • Predict missing classes with k-nearest neighbours
  • Outliers
    • Delete
    • Bin
    • Impute
    • Treat as separate to the others
  • Feature engineering
    • Transform and normalise: scale between 0-1
    • Eliminate skewness (e.g. log) for algorithms that require normal distribution
    • Create features: Date of transactions might not be useful but day of the week may be
  • Feature selection
    • Best features to use: identify via visualisation or through domain knowledge
    • Significance: Use p-values and other metrics to identify the right values. Can also use dimensionally reduction while preserving relationships in the data
  • Test multiple machine learning algorithms and tune their parameters
  • Ensemble methods: combine multiple week predictors (bagging and boosting)
Read More
Working with Imbalanced Classes

I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.

Tactics

  • Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
  • If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
  • Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
  • One potential solution to this problem is to collect more data
  • Resampling your dataset (can be random or non-random – stratified):
    • Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
    • Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
  • Try different algorithms: decision trees can perform well on imbalanced datasets
  • Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
  • There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.
Read More
Assessing machine learning algorithm performance

I wrote a few quick bullet points down from the article “How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python“.

Metrics

  • Classification accuracy
    • Test how well predictions of a model do overall
    • accuracy = correct predictions / total predictions
  • Confusion matrix
    • Use to identify how well your predictions did with different classes
    • Very useful if you have an imbalanced dataset
    • I wrote an extremely hacked together confusion matrix for my tag identification software. I had 4 classes (U, C, R, Q) and the confusion matrix shows you what your model predicted against what the real category was.

U

C

R

Q

U

175

17

67

1

C

11

335

14

0

R

26

8

298

0

Q

6

0

3

93

  • Mean absolute error for regression
    • Positive values – the average of how much your predicted value differ from the real value
  • Root mean squared error for regression
    • Square root of the mean of squared differences between the actual and predicted value
    • Squaring the values gives you positive numbers and finding the root lets you compare the values to the original units.
Read More
Machine Learning Recipes

I found an excellent tutorial series on Machine Learning on the Google Developers YouTube channel this weekend. It uses Python, scikit-learn and tensorflow and covers decision trees and k-nearest neighbours (KNN).

I really liked the focus on understanding what was going on underneath the hood. I followed along and implemented KNN from scratch and expanded on the base class they described to include the ability to include k as a variable. You can find my implementation in a Jupyter Notebook here.

Sidenote: If you want to visualise the decision tree, you’ll need to install the following libraries. I used homebrew to install graphviz but you could also use a package manger on Linux:


brew install graphviz
pip3 install pydotplus

Read More
Multiprocessing in Python

I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.

It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.

import multiprocessing

class ProcessHelper:
    def __init__(self, num_processes=4):
        self.num_processes = num_processes
    
    def split_list(self, data_list):
        list_of_lists = []
        for i in range(0, len(data_list), self.num_processes):
            list_of_lists.append(data_list[i:i+self.num_processes])
        return list_of_lists
    
    def map_reduce(self, function, data_list):
        split_data = self.split_list(data_list)
        processes = multiprocessing.Pool(processes=self.num_processes)
        results_list_of_lists = processes.map(function, split_data)
        processes.close()
        results_list = [item for sublist in results_list_of_lists for item in sublist]
        return results_list

To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.

def job(num_list):
    return [i*i for i in num_list]

data = range(20)

p = ProcessHelper(4)
result = p.map_reduce(job, data)
print(result)

So if my data originally was a list that looked like this:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):

[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]

Finally, the result will give me the list of squared values that looks like this:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

I’ll continue to build this class as I identify other handy helper methods that I could add.

Read More
Visual Diagnostics for More Informed Machine Learning

I recently watched Rebecca Bilbro’s presentation at PyCon 2016 and thought I’d share a few of my short notes from her interesting presentation.

Model Selection Triple

When selecting a model, rather than going with your default favourite method, take 3 things into account:

  • Feature analysis: intelligent feature selection and engineering
  • Model selection: model that makes most sense for problem/domain space
  • Hyperparameter Tuning: once model and features have been selected, select the parameters that result in optimal performance.

Visual Feature Analysis

  • Boxplots are a useful starting tool for looking at all features as they show you:
    • Central tendency
    • Distribution
    • Outliers
  • Histograms let you examine the distribution of a feature
  • Sploms: Pairwise plots of features to identify:
    • pairwise linear, quadratic and exponential relationships between variables
    • Homo/heteroscedasticity
    • How features are distributed relative to each other
  • Raduiz: Plot features around a circle and show how much pull they have
  • Parallel coordinates: lets you visualise multiple variables as line segments – you want to find separating chords which can help with classification

Evaluation Tools

  • Classification heat maps: show you areas where model is performing best
  • ROC-AUC and Prediction Error Plots: Show you which models are performing better
  • Residual plots: Show you which models are doing best and why
  • Gridsearch and validation curves: shows you the performance of a model along the parameters. You can create a visual heatmap for grid search
Read More
Multiple plots in figures with R

The other day I was working with R in a Jupyter Notebook when I discovered that I needed to include multiple figures in the same plot.

Surprisingly, R doesn’t include this capability out of the box, so I went searching and found this function that does the job. I’ve included the code below for my own future reference in case the linked site ever disappears.

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Then to call the function, you just have to pass it the plots:

multiplot(p1, p2, p3, p4, cols=2)
Read More
Using VNC to view the Raspberry Pi GUI

One of my side projects has involved playing around with the camera on a Raspberry Pi and I realised that it would be handy to see the camera output. After finding out that VNC tight server would do the job, the first step was to install it on the Raspberry Pi:

sudo apt-get install tightvncserver

With that done, I had to launch it:

vncserver :1 -geometry 800x600 -depth 24

The first time you set it up, you should be asked to give it a new password. You’ll need to run the command above each time you restart your Raspberry Pi.

Finally, to run this from my Mac, I can go to Finder > Go > Connect to servers and put in the command below (change the username and ip address).

open vnc://pi@10.20.66.94:5901
Read More
Finding rows in dataframe with a 0 value using Pandas

Recently I needed to identify which of the rows in a CSV file contained 0 values. This was interesting because normally I tend to look at this problem within columns rather than rows. Pandas provides a neat solution to this which I’ll demonstrate below using this data as an example:


import pandas as pd

d = {'a': [2,0,2,3,0,5, 9], 'b': [5,0,1,0,11,4,6]}
df = pd.DataFrame(d)

This data frame should look like this:


   a   b
0  0   0
1  2   1
2  3   0
3  0  11
4  5   4
5  9   6

Now, the final step to subset the data based on the existence of a 0 value within a row is to use the apply function and to look to see if a 0 is in the row values:


import pandas as pd

zero_rows_df = df[df.apply(lambda row: 0 in row.values, axis=1)]

That’s it! You should now have a dataframe containing only the 0 value rows:


   a   b
1  0   0
3  3   0
4  0  11

Read More