An issue I recently came across whilst using the Python requests module was that while I was trying to parse HTML text, I couldn’t remove the newline characters ‘
‘ with strip().
Content
I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.
BioSky is a website I’ve been setting up with Rina Soetanto recently. We are both doing PhDs in the biological sciences and have a keen interest in research and the treatments being developed in the biotech and health industry.
I wrote a few quick bullet points down from the article “8 Proven Ways for improving the “Accuracy” of a Machine Learning Model” for future reference.
Improving Accuracy
- Add more data
- Fix missing values
- Continuous: impute with median/mean/mode
- Categorical: treat as separate class
- Predict missing classes with k-nearest neighbours
- Outliers
- Delete
- Bin
- Impute
- Treat as separate to the others
- Feature engineering
- Transform and normalise: scale between 0-1
- Eliminate skewness (e.g. log) for algorithms that require normal distribution
- Create features: Date of transactions might not be useful but day of the week may be
- Feature selection
- Best features to use: identify via visualisation or through domain knowledge
- Significance: Use p-values and other metrics to identify the right values. Can also use dimensionally reduction while preserving relationships in the data
- Test multiple machine learning algorithms and tune their parameters
- Ensemble methods: combine multiple week predictors (bagging and boosting)
I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.
Tactics
- Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
- If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
- Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
- One potential solution to this problem is to collect more data
- Resampling your dataset (can be random or non-random – stratified):
- Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
- Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
- Try different algorithms: decision trees can perform well on imbalanced datasets
- Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
- There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.
I wrote a few quick bullet points down from the article “How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python“.
Metrics
- Classification accuracy
- Test how well predictions of a model do overall
- accuracy = correct predictions / total predictions
- Confusion matrix
- Use to identify how well your predictions did with different classes
- Very useful if you have an imbalanced dataset
- I wrote an extremely hacked together confusion matrix for my tag identification software. I had 4 classes (U, C, R, Q) and the confusion matrix shows you what your model predicted against what the real category was.
U |
C |
R |
Q |
|
U |
175 |
17 |
67 |
1 |
C |
11 |
335 |
14 |
0 |
R |
26 |
8 |
298 |
0 |
Q |
6 |
0 |
3 |
93 |
- Mean absolute error for regression
- Positive values – the average of how much your predicted value differ from the real value
- Root mean squared error for regression
- Square root of the mean of squared differences between the actual and predicted value
- Squaring the values gives you positive numbers and finding the root lets you compare the values to the original units.
I found an excellent tutorial series on Machine Learning on the Google Developers YouTube channel this weekend. It uses Python, scikit-learn and tensorflow and covers decision trees and k-nearest neighbours (KNN).
I really liked the focus on understanding what was going on underneath the hood. I followed along and implemented KNN from scratch and expanded on the base class they described to include the ability to include k as a variable. You can find my implementation in a Jupyter Notebook here.
Sidenote: If you want to visualise the decision tree, you’ll need to install the following libraries. I used homebrew to install graphviz but you could also use a package manger on Linux:
brew install graphviz pip3 install pydotplus
I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.
It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.
import multiprocessing class ProcessHelper: def __init__(self, num_processes=4): self.num_processes = num_processes def split_list(self, data_list): list_of_lists = [] for i in range(0, len(data_list), self.num_processes): list_of_lists.append(data_list[i:i+self.num_processes]) return list_of_lists def map_reduce(self, function, data_list): split_data = self.split_list(data_list) processes = multiprocessing.Pool(processes=self.num_processes) results_list_of_lists = processes.map(function, split_data) processes.close() results_list = [item for sublist in results_list_of_lists for item in sublist] return results_list
To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.
def job(num_list): return [i*i for i in num_list] data = range(20) p = ProcessHelper(4) result = p.map_reduce(job, data) print(result)
So if my data originally was a list that looked like this:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]
Finally, the result will give me the list of squared values that looks like this:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
I’ll continue to build this class as I identify other handy helper methods that I could add.
I recently watched Rebecca Bilbro’s presentation at PyCon 2016 and thought I’d share a few of my short notes from her interesting presentation.
Model Selection Triple
When selecting a model, rather than going with your default favourite method, take 3 things into account:
- Feature analysis: intelligent feature selection and engineering
- Model selection: model that makes most sense for problem/domain space
- Hyperparameter Tuning: once model and features have been selected, select the parameters that result in optimal performance.
Visual Feature Analysis
- Boxplots are a useful starting tool for looking at all features as they show you:
- Central tendency
- Distribution
- Outliers
- Histograms let you examine the distribution of a feature
- Sploms: Pairwise plots of features to identify:
- pairwise linear, quadratic and exponential relationships between variables
- Homo/heteroscedasticity
- How features are distributed relative to each other
- Raduiz: Plot features around a circle and show how much pull they have
- Parallel coordinates: lets you visualise multiple variables as line segments – you want to find separating chords which can help with classification
Evaluation Tools
- Classification heat maps: show you areas where model is performing best
- ROC-AUC and Prediction Error Plots: Show you which models are performing better
- Residual plots: Show you which models are doing best and why
- Gridsearch and validation curves: shows you the performance of a model along the parameters. You can create a visual heatmap for grid search
The other day I was working with R in a Jupyter Notebook when I discovered that I needed to include multiple figures in the same plot.
Surprisingly, R doesn’t include this capability out of the box, so I went searching and found this function that does the job. I’ve included the code below for my own future reference in case the linked site ever disappears.
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) { library(grid) # Make a list from the ... arguments and plotlist plots <- c(list(...), plotlist) numPlots = length(plots) # If layout is NULL, then use 'cols' to determine layout if (is.null(layout)) { # Make the panel # ncol: Number of columns of plots # nrow: Number of rows needed, calculated from # of cols layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, nrow = ceiling(numPlots/cols)) } if (numPlots==1) { print(plots[[1]]) } else { # Set up the page grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:numPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } } }
Then to call the function, you just have to pass it the plots:
multiplot(p1, p2, p3, p4, cols=2)