Category:

Content

Essential Libraries for Data Science on a Mac

by Jack Simpson December 28, 2016

written by Jack Simpson

I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.

The Python libraries I’ll frequently pip3 install include:

December 28, 2016 0 comments

Content Programming

Removing webpage newline characters in Python

by Jack Simpson December 27, 2016

written by Jack Simpson

An issue I recently came across whilst using the Python requests module was that while I was trying to parse HTML text, I couldn’t remove the newline characters ‘
‘ with strip().

December 27, 2016 0 comments

Content Data Science Tips & Tutorials

Best practices for data science with the Jupyter Notebook

by Jack Simpson December 18, 2016

written by Jack Simpson

I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.

December 18, 2016 0 comments

Blogging Content Research

Announcing BioSky

by Jack Simpson December 14, 2016

written by Jack Simpson

BioSky is a website I’ve been setting up with Rina Soetanto recently. We are both doing PhDs in the biological sciences and have a keen interest in research and the treatments being developed in the biotech and health industry.

December 14, 2016 0 comments

Content Data Science Machine Learning

Improving Model Accuracy

by Jack Simpson December 11, 2016

written by Jack Simpson

I wrote a few quick bullet points down from the article “8 Proven Ways for improving the “Accuracy” of a Machine Learning Model” for future reference.

Improving Accuracy

Add more data
Fix missing values
- Continuous: impute with median/mean/mode
- Categorical: treat as separate class
- Predict missing classes with k-nearest neighbours
Outliers
- Delete
- Bin
- Impute
- Treat as separate to the others
Feature engineering
- Transform and normalise: scale between 0-1
- Eliminate skewness (e.g. log) for algorithms that require normal distribution
- Create features: Date of transactions might not be useful but day of the week may be
Feature selection
- Best features to use: identify via visualisation or through domain knowledge
- Significance: Use p-values and other metrics to identify the right values. Can also use dimensionally reduction while preserving relationships in the data
Test multiple machine learning algorithms and tune their parameters
Ensemble methods: combine multiple week predictors (bagging and boosting)

December 11, 2016 0 comments

Content Data Science Machine Learning Tips & Tutorials

Working with Imbalanced Classes

by Jack Simpson December 11, 2016

written by Jack Simpson

I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.

Tactics

Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
One potential solution to this problem is to collect more data
Resampling your dataset (can be random or non-random – stratified):
- Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
- Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
Try different algorithms: decision trees can perform well on imbalanced datasets
Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.

December 11, 2016 0 comments

Content Data Science Machine Learning Tips & Tutorials

Assessing machine learning algorithm performance

by Jack Simpson December 11, 2016

written by Jack Simpson

I wrote a few quick bullet points down from the article “How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python“.

Metrics

Classification accuracy
- Test how well predictions of a model do overall
- accuracy = correct predictions / total predictions
Confusion matrix
- Use to identify how well your predictions did with different classes
- Very useful if you have an imbalanced dataset
- I wrote an extremely hacked together confusion matrix for my tag identification software. I had 4 classes (U, C, R, Q) and the confusion matrix shows you what your model predicted against what the real category was.

	U	C	R	Q
U	175	17	67	1
C	11	335	14	0
R	26	8	298	0
Q	6	0	3	93

Mean absolute error for regression
- Positive values – the average of how much your predicted value differ from the real value
Root mean squared error for regression
- Square root of the mean of squared differences between the actual and predicted value
- Squaring the values gives you positive numbers and finding the root lets you compare the values to the original units.

December 11, 2016 0 comments

Content Data Science Machine Learning Tips & Tutorials

Machine Learning Recipes

by Jack Simpson December 3, 2016

written by Jack Simpson

I found an excellent tutorial series on Machine Learning on the Google Developers YouTube channel this weekend. It uses Python, scikit-learn and tensorflow and covers decision trees and k-nearest neighbours (KNN).

I really liked the focus on understanding what was going on underneath the hood. I followed along and implemented KNN from scratch and expanded on the base class they described to include the ability to include k as a variable. You can find my implementation in a Jupyter Notebook here.

Sidenote: If you want to visualise the decision tree, you’ll need to install the following libraries. I used homebrew to install graphviz but you could also use a package manger on Linux:


brew install graphviz
pip3 install pydotplus

December 3, 2016 0 comments

Content Programming Tips & Tutorials

Multiprocessing in Python

by Jack Simpson November 23, 2016

written by Jack Simpson

I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.

It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.

import multiprocessing

class ProcessHelper:
def __init__(self, num_processes=4):
self.num_processes = num_processes

def split_list(self, data_list):
list_of_lists = []
for i in range(0, len(data_list), self.num_processes):
list_of_lists.append(data_list[i:i+self.num_processes])
return list_of_lists

def map_reduce(self, function, data_list):
split_data = self.split_list(data_list)
processes = multiprocessing.Pool(processes=self.num_processes)
results_list_of_lists = processes.map(function, split_data)
processes.close()
results_list = [item for sublist in results_list_of_lists for item in sublist]
return results_list

To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.

def job(num_list):
return [i*i for i in num_list]

data = range(20)

p = ProcessHelper(4)
result = p.map_reduce(job, data)
print(result)

So if my data originally was a list that looked like this:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):

[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]

Finally, the result will give me the list of squared values that looks like this:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

I’ll continue to build this class as I identify other handy helper methods that I could add.

November 23, 2016 0 comments

Content Data Science Machine Learning

Visual Diagnostics for More Informed Machine Learning

by Jack Simpson November 3, 2016

written by Jack Simpson

I recently watched Rebecca Bilbro’s presentation at PyCon 2016 and thought I’d share a few of my short notes from her interesting presentation.

Model Selection Triple

When selecting a model, rather than going with your default favourite method, take 3 things into account:

Feature analysis: intelligent feature selection and engineering
Model selection: model that makes most sense for problem/domain space
Hyperparameter Tuning: once model and features have been selected, select the parameters that result in optimal performance.

Visual Feature Analysis

Boxplots are a useful starting tool for looking at all features as they show you:
- Central tendency
- Distribution
- Outliers
Histograms let you examine the distribution of a feature
Sploms: Pairwise plots of features to identify:
- pairwise linear, quadratic and exponential relationships between variables
- Homo/heteroscedasticity
- How features are distributed relative to each other
Raduiz: Plot features around a circle and show how much pull they have
Parallel coordinates: lets you visualise multiple variables as line segments – you want to find separating chords which can help with classification

Evaluation Tools

Classification heat maps: show you areas where model is performing best
ROC-AUC and Prediction Error Plots: Show you which models are performing better
Residual plots: Show you which models are doing best and why
Gridsearch and validation curves: shows you the performance of a model along the parameters. You can create a visual heatmap for grid search

November 3, 2016 0 comments

Newer Posts

Older Posts