Explicit is better than implicit

Line two of the zen of Python reads “explicit is better than implicit” and until relatively recently I never truly appreciated the wisdom of those words. My change of heart stems from a series of Python scripts, where a large portion of my code dealt with automating and retrieving the results of a BLAST search using the fantastic BioPython toolkit. I was filtering my results based on an expect value of 0.04, which during my initial testing worked perfectly. However, as I wanted to make this value variable, I rewrote it into the program as a command-line argument. What I had not considered (but definitely should have) was how Python implicitly processes a command-line argument – as a string! I was never thrown an error – the program continued to work, so I assumed it was still doing the job just like in my tests. However, behind the scenes the filtering of my results had completely ceased to function.

The second (and in my opinion less obvious) issue of I have had with implicit design decisions relates to the qblast method from BioPython. As far as I could see, I was retrieving plenty of sequences, therefore the program must be working. However, my PI was rather suspicious of how so few sequences were being retrieved compared to the hundreds that were coming up when she would BLAST our sequence with the online interface. I searched numerous sites and went through the BioPython documentation but could find no mention of a sequence retrieval cut-off. Finally in desparation I went through the source code itself from the module that I was using and found this:

qblast code

A default limit of 50 unless overruled! A few extra characters and the problem was fixed, but until this discovery I was having serious problems later in the pipeline that I could not understand.

[Edit] After a Twitter conversation with Peter from the BioPython Project, I’d like to add that this issue is due to the default settings of the online BLAST tool I was calling, as well as potentially the settings of the BioPython wrapper. A good lesson in understanding the defaults of the tools (BLAST) your tools (BioPython) are calling!

These are two of the most recent examples that I have seen of how aware developers need to be about the implicit default values and methods that are present in the language and library they are calling. I hope my mistakes and learning experience will be useful to others who may come across similar issues.

This was originally posted by myself on the Australian Bioinformatics Network.

Read More
RStudio panes
R Tutorial

This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.


Getting Started

The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).

Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.

Read More
BioPython Tutorial

This tutorial is a brief overview of what you can achieve using the Python BioPython module. Although I’m hoping to write up some more articles on this site for beginners when time permits, this post will assume that you have experience programming in Python and have a bit of an understanding of basic biological concepts such as DNA, restriction enzymes etc. If you’re still interested once you finish reading, feel free to consult the BioPython documentation, it will help give you a bit of an idea of how massive (and awesome) this module really is.

So to start I’ll show you how to install the BioPython module. While on Linux systems it can be as simple as typing ‘sudo apt-get install python-biopython’ or going to the Software Center, you can manually install a module by going to PyPI, downloading and extracting the file, opening the command-line or terminal and navigating into the root directory of the folder you just extracted and running the command ‘setup.py install’.

You will need to download two modules to install BioPython, each of which are hosted on their own site. The first is SciPy and the second is BioPython. Once you have installed these you’re ready to get into using BioPython.

Read More
Pyral Project

Pyral (Python + Viral) was the name of a project I worked on in Dr Joanne Macdonald’s lab between September 2012 – January 2013 (although I am still providing tech support for the code and helping manage the server to this date). Throughout this time I wrote a lot of Perl and Python code to run on the university’s Linux server. The aim of these programs were as follows:

  • Download all the viral ref-seq genomes from GenBank;
  • BLAST a sequence of interest and retrieve all similar files;
  • Concatenate all sequences into one file that was run through CD-HIT;
  • Analyse the CD-HIT output, returning a file with the cluster numbers that sequences of interest may be found in;
  • Find variable length conserved regions of DNA within a designated cluster;
  • Ensure conserved region of DNA is completely dissimilar to that found in other virus clusters.
Read More
Python sys Module

The main use I’ve found for the Python sys module is allowing command-line arguments to be made to a script. Here is an example of how it looks:

import sys

if len(sys.argv) == 2:
    input_file = sys.argv[1]
    print "Please input a command-line argument specifying the file"

This script checks that 2 command-line arguments had been passed to the program before assigning the value sys.argv[1] to a variable. We check for two command-line arguments because the first one (sys.argv[0]) is the name of the Python script currently being executed.

Read More
Python os Module

You can use the Python os module to send commands out to the operating system. Through os you can do anything from changing the current working directory through to listing the contents of a directory.

import os

# print the current working directory
print os.getcwd()
# change the working directory
# create a new directory
# list the contents of a directory
# remove directory
# remove a file
# rename a file/directory and where it will go
os.rename("/data/jack", "/james")

Read More
Useful Linux Commands

Linux is an operating system – similar to Windows or Mac. It is freely available and very popular amongst researchers and programmers. Since many of the scripts you write will be run from the terminal, a list of popular Linux commands can be quite handy:

  • cd Used to change the directory (folder) which you have open. If I wanted to navigate to one of my directories I may write the command “cd /data/research/project”.
  • ls List the contents of a directory.
  • pwd Show the path to the current working directory.
  • mkdir This command followed by a name creates a directory.
  • rm Can be used to specify an empty directory or file you want to remove, for instance if you had a directory called ‘data’ you could delete it using “rm data”. If I wanted to delete all the files and directories in the directory “data”, I could use this command: “rm -rf data/*”.
  • tar –zxvf file.tar.gz Used to unzip a file.
  • chmod 777 filename.py Used to change the permissions on a file – this allows uses to read, write and execute this file.
  • sudo su – Enable root user privileges. You’ll be asked to input the root password if you issue this command.
  • cp firstname secondname Copy a file somewhere.
  • mv firstname secondname  Move a file and rename it.
  • ssh Log into another server using an SSH client through the terminal by inputting “ssh username@serveraddress”.
Read More
Parsing CSV Files in Python

Almost anyone doing research at some point will have to input or extract data from an Excel spreadsheet. However, it is possible to export an Excel file as a .csv file (comma separated file) and then use Python to grab all the data instantly. The Python csv module makes all this incredibly simple:

import csv
column1 = []
column2 = []
column3 = []
openfile = csv.reader(open('data.csv'))
for row in openfile:
    first, second, third = row

This script will open your .csv file, and then use the ‘.next()’ method to create an array of all the lines in your file, which you can parse through with the for loop.

Read More
Logging into PuTTY with Perl

When you use the SSH client PuTTY a lot, it can become quite tedious selecting your server and then writing in your username and password. That’s why I decided to write a short script in Perl to speed things up, and it looks like this:

use strict;
use warnings;
open my $copy, "| xclip";
print $copy Password";
close $copy;
my $external = `putty yourserver.com.au -l YourUsername -pw YourPassord`;

I’m copying my sudo password to the Linux application xclip so its ready in my clipboard, before opening PuTTY and automatically signing in. I then converted the script into an executable file using the instructions found here.

Read More
Running an R script through Python

Before I made the switch to developing on a Linux machine, I noticed that the Python module for calling R (RPy2) seemed to be having some problems on Windows. This gave me an excuse to play around with writing my own Python script to create and run an R script. As you’ll see in the code below, I’ve used the subprocess module to execute the R script I created and then pipe the results back into the terminal.

import subprocess

# rscript is the content of the R file we'll create
rscript = '''#!/usr/bin/Rscript

cat('This is a simple program') # same as c()
NumOfIterations for (i in 1:NumOfIterations) { # 1:10 is range
cat(i, 'Hello world!')
cat(' ')
r_file = open("example.r", "w")

r_file_name = "example.r"
ppath = r'C:Program FilesRR-2.15.2bini386Rscript.exe'
proc = subprocess.Popen("%s %s" % (ppath, r_file_name), stdout=subprocess.PIPE)
output = proc.stdout.read()
print output
Read More