Tag:

Python

How to vectorize conditional calculations in Python

by Jack Simpson April 4, 2021

written by Jack Simpson

Pandas and NumPy are fantastic libraries that enable you to take advantage of vectorization to write extremely efficient Python code. However, what happens when the calculation you wish to run changes based on the value in another column of your dataset?

For example, take a look at the dataset in the table below (along with the code to generate it):

Group	Value
A	1
A	1
B	1
C	1

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Group':['A','A','B','C'],
    'Value':[1,1,1,1]
})

Imagine I wish to create a third column (‘Result’) based on the following logic:

Multiply Value by 2 if Group == ‘A’
Multiply Value by 3 if Group == ‘B’
Multiply Value by 4 if Group == ‘C’
Fill with a missing value (nan) if none of the above is true

April 4, 2021 0 comments

Content Programming

Google have released a Python to Go transcompiler

by Jack Simpson January 11, 2017

written by Jack Simpson

Google have released an open source project on GitHub called Grumpy that converts Python to Go, and then compiles it down to native code.

It’s an interesting development, but since they won’t be supporting C extension modules (which basically rules out all the scientific and machine learning libraries I use), it means I probably won’t end up using this new tool too much.

January 11, 2017 0 comments

Content Data Science Machine Learning

Deep Learning PyData Talk

by Jack Simpson January 7, 2017

written by Jack Simpson

Deep learning is a type of machine learning based on neural networks which were inspired by neurons in the brain. The difference between a deep neural network and a normal natural network is the number of ‘hidden layers’ between the input and output layers.

I recently watched an excellent presentation on Deep Learning by Roelof Pieters titled ‘Python for image and text understanding; One model to rule them all!‘ I can recommend watching it, and I’ve written this post for me to put down a few of my own bullet points from the talk for future reference.

January 7, 2017 0 comments

Content Data Science Machine Learning

Essential Libraries for Data Science on a Mac

by Jack Simpson December 28, 2016

written by Jack Simpson

I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.

The Python libraries I’ll frequently pip3 install include:

December 28, 2016 0 comments

Content Programming

Removing webpage newline characters in Python

by Jack Simpson December 27, 2016

written by Jack Simpson

An issue I recently came across whilst using the Python requests module was that while I was trying to parse HTML text, I couldn’t remove the newline characters ‘
‘ with strip().

December 27, 2016 0 comments

Content Data Science Tips & Tutorials

Best practices for data science with the Jupyter Notebook

by Jack Simpson December 18, 2016

written by Jack Simpson

I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.

December 18, 2016 0 comments

Content Data Science Machine Learning Tips & Tutorials

Machine Learning Recipes

by Jack Simpson December 3, 2016

written by Jack Simpson

I found an excellent tutorial series on Machine Learning on the Google Developers YouTube channel this weekend. It uses Python, scikit-learn and tensorflow and covers decision trees and k-nearest neighbours (KNN).

I really liked the focus on understanding what was going on underneath the hood. I followed along and implemented KNN from scratch and expanded on the base class they described to include the ability to include k as a variable. You can find my implementation in a Jupyter Notebook here.

Sidenote: If you want to visualise the decision tree, you’ll need to install the following libraries. I used homebrew to install graphviz but you could also use a package manger on Linux:


brew install graphviz
pip3 install pydotplus

December 3, 2016 0 comments

Content Programming Tips & Tutorials

Multiprocessing in Python

by Jack Simpson November 23, 2016

written by Jack Simpson

I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.

It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.

import multiprocessing

class ProcessHelper:
def __init__(self, num_processes=4):
self.num_processes = num_processes

def split_list(self, data_list):
list_of_lists = []
for i in range(0, len(data_list), self.num_processes):
list_of_lists.append(data_list[i:i+self.num_processes])
return list_of_lists

def map_reduce(self, function, data_list):
split_data = self.split_list(data_list)
processes = multiprocessing.Pool(processes=self.num_processes)
results_list_of_lists = processes.map(function, split_data)
processes.close()
results_list = [item for sublist in results_list_of_lists for item in sublist]
return results_list

To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.

def job(num_list):
return [i*i for i in num_list]

data = range(20)

p = ProcessHelper(4)
result = p.map_reduce(job, data)
print(result)

So if my data originally was a list that looked like this:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):

[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]

Finally, the result will give me the list of squared values that looks like this:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

I’ll continue to build this class as I identify other handy helper methods that I could add.

November 23, 2016 0 comments

Content Data Science Machine Learning

Visual Diagnostics for More Informed Machine Learning

by Jack Simpson November 3, 2016

written by Jack Simpson

I recently watched Rebecca Bilbro’s presentation at PyCon 2016 and thought I’d share a few of my short notes from her interesting presentation.

Model Selection Triple

When selecting a model, rather than going with your default favourite method, take 3 things into account:

Feature analysis: intelligent feature selection and engineering
Model selection: model that makes most sense for problem/domain space
Hyperparameter Tuning: once model and features have been selected, select the parameters that result in optimal performance.

Visual Feature Analysis

Boxplots are a useful starting tool for looking at all features as they show you:
- Central tendency
- Distribution
- Outliers
Histograms let you examine the distribution of a feature
Sploms: Pairwise plots of features to identify:
- pairwise linear, quadratic and exponential relationships between variables
- Homo/heteroscedasticity
- How features are distributed relative to each other
Raduiz: Plot features around a circle and show how much pull they have
Parallel coordinates: lets you visualise multiple variables as line segments – you want to find separating chords which can help with classification

Evaluation Tools

Classification heat maps: show you areas where model is performing best
ROC-AUC and Prediction Error Plots: Show you which models are performing better
Residual plots: Show you which models are doing best and why
Gridsearch and validation curves: shows you the performance of a model along the parameters. You can create a visual heatmap for grid search

November 3, 2016 0 comments

Content Programming Tips & Tutorials

Finding rows in dataframe with a 0 value using Pandas

by Jack Simpson July 4, 2016

written by Jack Simpson

Recently I needed to identify which of the rows in a CSV file contained 0 values. This was interesting because normally I tend to look at this problem within columns rather than rows. Pandas provides a neat solution to this which I’ll demonstrate below using this data as an example:


import pandas as pd

d = {'a': [2,0,2,3,0,5, 9], 'b': [5,0,1,0,11,4,6]}
df = pd.DataFrame(d)

This data frame should look like this:

Now, the final step to subset the data based on the existence of a 0 value within a row is to use the apply function and to look to see if a 0 is in the row values:


import pandas as pd

zero_rows_df = df[df.apply(lambda row: 0 in row.values, axis=1)]

That’s it! You should now have a dataframe containing only the 0 value rows:

July 4, 2016 0 comments

Newer Posts

Older Posts