One of the best things about the iPython notebook is the number of easy-to-follow tutorials it has inspired. I thought I’d share a few that I’ve found on machine learning and statistics.

Python for Developers – great resource for those wanting to learn and/or deepen their understanding of Python.
Machine Learning with scikit-learn – provides a good introduction and background to machine learning.
Machine learning with Python – covers regression, neural networks, decision trees.
Machine Learning with Python – covers PCA, k-means clustering, k-nearest neighbours.
Learn Data Science with Python – covers regression, random forests, k-means clustering.
Probabilistic Programming & Bayesian Methods for Hackers – covers Bayesian methods including Markov Chain Monte Carlo.
Bayesian data analysis – covers how probabilistic programming works.
Supervised Learning SVM – covers Support Vector Machines (SVM)
Face Recognition– covers PCA, and SVM.
Particle Filter – covers the identification and tracking of objects in a video.

I’ll continue to update the list as I find new notebooks I find handy.

April 19, 2014 0 comments

Content Links Programming Resources

Useful links for using ggplot2 in R

by Jack Simpson April 19, 2014

written by Jack Simpson

The R package ggplot2 is one of the best data visualisation tools I’ve come across, and while it simplifies generating impressive graphics, there’s still a bit you have to learn to use it. Here are a few of the posts I’ve found really handy when using this package:

I’ll try to continue updating this list as I find other good resources.

April 19, 2014 0 comments

Content Image Processing OpenCV Research

Tracking bees using local features

by Jack Simpson April 18, 2014

written by Jack Simpson

The project I am working on at present is focused on building an automated system for tracking the movement of honeybees in an observation hive, filmed using a grayscale camera and infrared light. My recent attempt at this has involved extracting features (or keypoints) from the frames and then describing them using the SIFT (Scale Invariant Feature Transform) algorithm, implemented in the OpenCV imaging library. There are a number of other (potentially faster) feature detection algorithms available (SURF, ORB, etc), however as SIFT has been traditionally regarded as the most robust, I thought I would start with it. SIFT works first by extracting what it deems to be robust keypoints/features. For each of these, it will then compute a unique feature descriptor based on the 16×16 pixel area of the feature. By extracting and describing these features, you can try to compare images as shown below.

April 18, 2014 0 comments

Blogging Content Research

Essential Free Tools for Research

by Jack Simpson April 13, 2014

written by Jack Simpson

I commenced my PhD in bioinformatics at the Australian National University about a month ago and thought I’d share some of the tools I’ve found absolutely essential. All of these tools are free to use (some do have paid plans, however at this point I haven’t had the need to sign up for them).

Evernote + Evernote Web Clipper: Once I started to really get into using this tool, I couldn’t understand how I’d lived without it before. Now I can clip snippets from web pages for later reference, write notes from a seminar I attended or upload a Python script and have this all in one easily searchable (and tag-able) location. There’s also a plethora of phone/tablet apps which work really well.
Wunderlist: While I tried to use Evernote for keeping to-do lists, I found that this simple yet powerful program (with support for PC, Mac, phones and tablets) worked extremely well for keeping track of my tasks.
Dropbox: I do all my work within the Dropbox folder on my computer. This way, when I move between computers everything is synced and backed-up. While I have tried Google Drive before, the lack of support for Linux as well as issues with syncing have led me back to using Dropbox.
Zotero: Zotero is a referencing manager with a browser plug-in which is fantastic at importing the bibliographic details and pdf of the paper from the web. It can also sync between computers (although you only get 300 mb of free sync space). While I have used the alternative referencing manager Mendeley before (which has 2gb free sync space and a brilliant program which enables you to annotate your pdfs), the Mendeley web import tool was so bad at importing the bibliographic details and pdfs of papers that I went over to Zotero.
VirtualBox: If you find yourself having to use a lot of Linux-only tools (or you like the fact that you can install entire programs with multiple dependencies merely with the command ‘sudo apt-get’) but don’t really want to set up anything permanent , then VirtualBox could be the way to go. Its a free program that allows you to run a whole operating system on your computer within a window just like any other program.
Coursera/EdX/Udacity: These websites offer free courses from great unis in everything from computer science to genomics.
Stack Overflow/Biostars: Great websites for asking questions relating to programming or bioinformatics. If you can’t find anything after searching Google, then this is the place to post your question.
Twitter: Although this can be a bit of a time-waster, I find that by following other researchers in my field and research organisations I can hear about new papers or tutorials.
Google Plus: G+ may not be as popular as Facebook, but I find most of my programming friends tend to be quite active on it and if you join one of the programming communities it can be a great place to learn new things or ask questions.
Google Scholar Alerts: Sign up to receive emails whenever the keywords you select are published in a paper.

April 13, 2014 0 comments

Blogging Content Programming

Explicit is better than implicit

by Jack Simpson March 7, 2014

written by Jack Simpson

Line two of the zen of Python reads “explicit is better than implicit” and until relatively recently I never truly appreciated the wisdom of those words. My change of heart stems from a series of Python scripts, where a large portion of my code dealt with automating and retrieving the results of a BLAST search using the fantastic BioPython toolkit. I was filtering my results based on an expect value of 0.04, which during my initial testing worked perfectly. However, as I wanted to make this value variable, I rewrote it into the program as a command-line argument. What I had not considered (but definitely should have) was how Python implicitly processes a command-line argument – as a string! I was never thrown an error – the program continued to work, so I assumed it was still doing the job just like in my tests. However, behind the scenes the filtering of my results had completely ceased to function.

The second (and in my opinion less obvious) issue of I have had with implicit design decisions relates to the qblast method from BioPython. As far as I could see, I was retrieving plenty of sequences, therefore the program must be working. However, my PI was rather suspicious of how so few sequences were being retrieved compared to the hundreds that were coming up when she would BLAST our sequence with the online interface. I searched numerous sites and went through the BioPython documentation but could find no mention of a sequence retrieval cut-off. Finally in desparation I went through the source code itself from the module that I was using and found this:

A default limit of 50 unless overruled! A few extra characters and the problem was fixed, but until this discovery I was having serious problems later in the pipeline that I could not understand.

[Edit] After a Twitter conversation with Peter from the BioPython Project, I’d like to add that this issue is due to the default settings of the online BLAST tool I was calling, as well as potentially the settings of the BioPython wrapper. A good lesson in understanding the defaults of the tools (BLAST) your tools (BioPython) are calling!

These are two of the most recent examples that I have seen of how aware developers need to be about the implicit default values and methods that are present in the language and library they are calling. I hope my mistakes and learning experience will be useful to others who may come across similar issues.

This was originally posted by myself on the Australian Bioinformatics Network.

March 7, 2014 0 comments

Content Programming Tips & Tutorials

R Tutorial

by Jack Simpson May 9, 2013

written by Jack Simpson

This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.

Index

Getting started
Basics
Importing Data
Tests
Graphics
Packages

Getting Started

The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).

Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.

May 9, 2013 0 comments

Content Programming Research Tips & Tutorials

BioPython Tutorial

by Jack Simpson May 6, 2013

written by Jack Simpson

This tutorial is a brief overview of what you can achieve using the Python BioPython module. Although I’m hoping to write up some more articles on this site for beginners when time permits, this post will assume that you have experience programming in Python and have a bit of an understanding of basic biological concepts such as DNA, restriction enzymes etc. If you’re still interested once you finish reading, feel free to consult the BioPython documentation, it will help give you a bit of an idea of how massive (and awesome) this module really is.

So to start I’ll show you how to install the BioPython module. While on Linux systems it can be as simple as typing ‘sudo apt-get install python-biopython’ or going to the Software Center, you can manually install a module by going to PyPI, downloading and extracting the file, opening the command-line or terminal and navigating into the root directory of the folder you just extracted and running the command ‘setup.py install’.

You will need to download two modules to install BioPython, each of which are hosted on their own site. The first is SciPy and the second is BioPython. Once you have installed these you’re ready to get into using BioPython.

May 6, 2013 0 comments

Content Programming Research Tips & Tutorials

Pyral Project

by Jack Simpson May 6, 2013

written by Jack Simpson

Pyral (Python + Viral) was the name of a project I worked on in Dr Joanne Macdonald’s lab between September 2012 – January 2013 (although I am still providing tech support for the code and helping manage the server to this date). Throughout this time I wrote a lot of Perl and Python code to run on the university’s Linux server. The aim of these programs were as follows:

Download all the viral ref-seq genomes from GenBank;
BLAST a sequence of interest and retrieve all similar files;
Concatenate all sequences into one file that was run through CD-HIT;
Analyse the CD-HIT output, returning a file with the cluster numbers that sequences of interest may be found in;
Find variable length conserved regions of DNA within a designated cluster;
Ensure conserved region of DNA is completely dissimilar to that found in other virus clusters.
Continue Reading

May 6, 2013 0 comments

Content Programming Tips & Tutorials

Python sys Module

by Jack Simpson February 8, 2013

written by Jack Simpson

The main use I’ve found for the Python sys module is allowing command-line arguments to be made to a script. Here is an example of how it looks:


import sys

if len(sys.argv) == 2:
    input_file = sys.argv[1]
else:
    print "Please input a command-line argument specifying the file"

This script checks that 2 command-line arguments had been passed to the program before assigning the value sys.argv[1] to a variable. We check for two command-line arguments because the first one (sys.argv[0]) is the name of the Python script currently being executed.

February 8, 2013 0 comments

Newer Posts

Older Posts