I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.
The Python libraries I’ll frequently pip3 install include:
I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.
The Python libraries I’ll frequently pip3 install include:
An issue I recently came across whilst using the Python requests module was that while I was trying to parse HTML text, I couldn’t remove the newline characters ‘
‘ with strip().
I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.
BioSky is a website I’ve been setting up with Rina Soetanto recently. We are both doing PhDs in the biological sciences and have a keen interest in research and the treatments being developed in the biotech and health industry.
I wrote a few quick bullet points down from the article “8 Proven Ways for improving the “Accuracy” of a Machine Learning Model” for future reference.
Improving Accuracy
I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.
Tactics
I wrote a few quick bullet points down from the article “How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python“.
Metrics
|
U |
C |
R |
Q |
|
|
U |
175 |
17 |
67 |
1 |
|
C |
11 |
335 |
14 |
0 |
|
R |
26 |
8 |
298 |
0 |
|
Q |
6 |
0 |
3 |
93 |
I found an excellent tutorial series on Machine Learning on the Google Developers YouTube channel this weekend. It uses Python, scikit-learn and tensorflow and covers decision trees and k-nearest neighbours (KNN).
I really liked the focus on understanding what was going on underneath the hood. I followed along and implemented KNN from scratch and expanded on the base class they described to include the ability to include k as a variable. You can find my implementation in a Jupyter Notebook here.
Sidenote: If you want to visualise the decision tree, you’ll need to install the following libraries. I used homebrew to install graphviz but you could also use a package manger on Linux:
brew install graphviz pip3 install pydotplus
I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.
It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.
import multiprocessing class ProcessHelper: def __init__(self, num_processes=4): self.num_processes = num_processes def split_list(self, data_list): list_of_lists = [] for i in range(0, len(data_list), self.num_processes): list_of_lists.append(data_list[i:i+self.num_processes]) return list_of_lists def map_reduce(self, function, data_list): split_data = self.split_list(data_list) processes = multiprocessing.Pool(processes=self.num_processes) results_list_of_lists = processes.map(function, split_data) processes.close() results_list = [item for sublist in results_list_of_lists for item in sublist] return results_list
To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.
def job(num_list): return [i*i for i in num_list] data = range(20) p = ProcessHelper(4) result = p.map_reduce(job, data) print(result)
So if my data originally was a list that looked like this:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]
Finally, the result will give me the list of squared values that looks like this:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
I’ll continue to build this class as I identify other handy helper methods that I could add.
I recently watched Rebecca Bilbro’s presentation at PyCon 2016 and thought I’d share a few of my short notes from her interesting presentation.
Model Selection Triple
When selecting a model, rather than going with your default favourite method, take 3 things into account:
Visual Feature Analysis
Evaluation Tools