How to create random DNA sequences with Python

Python’s random module makes it extremely easy to generate random DNA bases.

import random
dna = ["A","G","C","T"]
#output a random base

Now to generate a specific number of random bases, all we have to do is use Python’s range function:

Read More
Sublime Text OpenCV C++ Build System

If you want to compile and run a C++ program using OpenCV in Sublime Text, then copy and paste the code below into a build system file. If you’re not interested in explicitly using C++11 you can delete the “-std=c++0x” section.

"shell_cmd": "g++ -std=c++0x ${file} -o ${file_base_name} `pkg-config --cflags --libs opencv` && ./${file_base_name}"

Read More
Some interesting talks from PyCon 2014

Here are several talks from PyCon 2014 I thought looked rather interesting from a research perspective:



Read More
Where to find R tutorials

Here’s a list of a few R tutorials (in addition to the one I wrote), which I’ve found (or look) rather useful:

Google Developers R Tutorials

A slightly Different Introduction to R

Read More
Educational iPython Notebooks

One of the best things about the iPython notebook is the number of easy-to-follow tutorials it has inspired. I thought I’d share a few that I’ve found on machine learning and statistics.

I’ll continue to update the list as I find new notebooks I find handy.

Read More
Useful links for using ggplot2 in R

The R package ggplot2 is one of the best data visualisation tools I’ve come across, and while it simplifies generating impressive graphics, there’s still a bit you have to learn to use it. Here are a few of the posts I’ve found really handy when using this package:

I’ll try to continue updating this list as I find other good resources.

Read More
Tracking bees using local features

The project I am working on at present is focused on building an automated system for tracking the movement of honeybees in an observation hive, filmed using a grayscale camera and infrared light. My recent attempt at this has involved extracting features (or keypoints) from the frames and then describing them using the SIFT (Scale Invariant Feature Transform) algorithm, implemented in the OpenCV imaging library. There are a number of other (potentially faster) feature detection algorithms available (SURF, ORB, etc), however as SIFT has been traditionally regarded as the most robust, I thought I would start with it. SIFT works first by extracting what it deems to be robust keypoints/features. For each of these, it will then compute a unique feature descriptor based on the 16×16 pixel area of the feature. By extracting and describing these features, you can try to compare images as shown below.

Read More
Essential Free Tools for Research

I commenced my PhD in bioinformatics at the Australian National University about a month ago and thought I’d share some of the tools I’ve found absolutely essential. All of these tools are free to use (some do have paid plans, however at this point I haven’t had the need to sign up for them).

  1. Evernote + Evernote Web Clipper: Once I started to really get into using this tool, I couldn’t understand how I’d lived without it before. Now I can clip snippets from web pages for later reference, write notes from a seminar I attended or upload a Python script and have this all in one easily searchable (and tag-able) location. There’s also a plethora of phone/tablet apps which work really well.
  2. Wunderlist: While I tried to use Evernote for keeping to-do lists, I found that this simple yet powerful program (with support for PC, Mac, phones and tablets) worked extremely well for keeping track of my tasks.
  3. Dropbox: I do all my work within the Dropbox folder on my computer. This way, when I move between computers everything is synced and backed-up. While I have tried Google Drive before, the lack of support for Linux as well as issues with syncing have led me back to using Dropbox.
  4. Zotero: Zotero is a referencing manager with a browser plug-in which is fantastic at importing the bibliographic details and pdf of the paper from the web. It can also sync between computers (although you only get 300 mb of free sync space). While I have used the alternative referencing manager Mendeley before (which has 2gb free sync space and a brilliant program which enables you to annotate your pdfs), the Mendeley web import tool was so bad at importing the bibliographic details and pdfs of papers that I went over to Zotero.
  5. VirtualBox: If you find yourself having to use a lot of Linux-only tools (or you like the fact that you can install entire programs with multiple dependencies merely with the command ‘sudo apt-get’) but don’t really want to set up anything permanent , then VirtualBox could be the way to go. Its a free program that allows you to run a whole operating system on your computer within a window just like any other program.
  6. Coursera/EdX/Udacity: These websites offer free courses from great unis in everything from computer science to genomics.
  7. Stack Overflow/Biostars: Great websites for asking questions relating to programming or bioinformatics. If you can’t find anything after searching Google, then this is the place to post your question.
  8. Twitter: Although this can be a bit of a time-waster, I find that by following other researchers in my field and research organisations I can hear about new papers or tutorials.
  9. Google Plus: G+ may not be as popular as Facebook, but I find most of my programming friends tend to be quite active on it and if you join one of the programming communities it can be a great place to learn new things or ask questions.
  10. Google Scholar Alerts: Sign up to receive emails whenever the keywords you select are published in a paper.
Read More
Explicit is better than implicit

Line two of the zen of Python reads “explicit is better than implicit” and until relatively recently I never truly appreciated the wisdom of those words. My change of heart stems from a series of Python scripts, where a large portion of my code dealt with automating and retrieving the results of a BLAST search using the fantastic BioPython toolkit. I was filtering my results based on an expect value of 0.04, which during my initial testing worked perfectly. However, as I wanted to make this value variable, I rewrote it into the program as a command-line argument. What I had not considered (but definitely should have) was how Python implicitly processes a command-line argument – as a string! I was never thrown an error – the program continued to work, so I assumed it was still doing the job just like in my tests. However, behind the scenes the filtering of my results had completely ceased to function.

The second (and in my opinion less obvious) issue of I have had with implicit design decisions relates to the qblast method from BioPython. As far as I could see, I was retrieving plenty of sequences, therefore the program must be working. However, my PI was rather suspicious of how so few sequences were being retrieved compared to the hundreds that were coming up when she would BLAST our sequence with the online interface. I searched numerous sites and went through the BioPython documentation but could find no mention of a sequence retrieval cut-off. Finally in desparation I went through the source code itself from the module that I was using and found this:

qblast code

A default limit of 50 unless overruled! A few extra characters and the problem was fixed, but until this discovery I was having serious problems later in the pipeline that I could not understand.

[Edit] After a Twitter conversation with Peter from the BioPython Project, I’d like to add that this issue is due to the default settings of the online BLAST tool I was calling, as well as potentially the settings of the BioPython wrapper. A good lesson in understanding the defaults of the tools (BLAST) your tools (BioPython) are calling!

These are two of the most recent examples that I have seen of how aware developers need to be about the implicit default values and methods that are present in the language and library they are calling. I hope my mistakes and learning experience will be useful to others who may come across similar issues.

This was originally posted by myself on the Australian Bioinformatics Network.

Read More
RStudio panes
R Tutorial

This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.


Getting Started

The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).

Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.

Read More