Deep Learning PyData Talk

Deep learning is a type of machine learning based on neural networks which were inspired by neurons in the brain. The difference between a deep neural network and a normal natural network is the number of ‘hidden layers’ between the input and output layers.

I recently watched an excellent presentation on Deep Learning by Roelof Pieters titled ‘Python for image and text understanding; One model to rule them all!‘ I can recommend watching it, and I’ve written this post for me to put down a few of my own bullet points from the talk for future reference.

Read More
Essential WordPress Plugins for Blogging

Over the past few years of blogging using WordPress I’ve found a number of essential plugins for the platform. I’ve written this post to provide as a resource for myself and others and will try to periodically update it if I find new/better plugins.

Plugins I’m using at the moment:

Read More
Why I refused a job offer to teach corporate programming workshops

A couple of months ago I was approached by an organisation that provided programming training to staff at companies. They asked me if I was interested in becoming a trainer for them based on my experience running Software Carpentry workshops.

After seeking clarification and looking through the teaching materials, I refused.

Read More
Essential Libraries for Data Science on a Mac

I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.

The Python libraries I’ll frequently pip3 install include:

Read More
Excel confusing CSV file with SYLK file

I recently had an interesting experience whilst using pandas to write some data to a CSV file and then opening the file up with Excel to inspect its contents. To my surprise, I received a message from Excel informing me that I was attempting to open something called a ‘SYLK file’.

Read More
Removing webpage newline characters in Python

An issue I recently came across whilst using the Python requests module was that while I was trying to parse HTML text, I couldn’t remove the newline characters ‘
‘ with strip().

Read More
Best practices for data science with the Jupyter Notebook

I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.

Read More
Announcing BioSky

BioSky is a website I’ve been setting up with Rina Soetanto recently. We are both doing PhDs in the biological sciences and have a keen interest in research and the treatments being developed in the biotech and health industry.

Read More
Improving Model Accuracy

I wrote a few quick bullet points down from the article “8 Proven Ways for improving the “Accuracy” of a Machine Learning Model” for future reference.

Improving Accuracy

  • Add more data
  • Fix missing values
    • Continuous: impute with median/mean/mode
    • Categorical: treat as separate class
    • Predict missing classes with k-nearest neighbours
  • Outliers
    • Delete
    • Bin
    • Impute
    • Treat as separate to the others
  • Feature engineering
    • Transform and normalise: scale between 0-1
    • Eliminate skewness (e.g. log) for algorithms that require normal distribution
    • Create features: Date of transactions might not be useful but day of the week may be
  • Feature selection
    • Best features to use: identify via visualisation or through domain knowledge
    • Significance: Use p-values and other metrics to identify the right values. Can also use dimensionally reduction while preserving relationships in the data
  • Test multiple machine learning algorithms and tune their parameters
  • Ensemble methods: combine multiple week predictors (bagging and boosting)
Read More
Working with Imbalanced Classes

I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.

Tactics

  • Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
  • If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
  • Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
  • One potential solution to this problem is to collect more data
  • Resampling your dataset (can be random or non-random – stratified):
    • Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
    • Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
  • Try different algorithms: decision trees can perform well on imbalanced datasets
  • Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
  • There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.
Read More