Pyral Project

Pyral (Python + Viral) was the name of a project I worked on in Dr Joanne Macdonald’s lab between September 2012 – January 2013 (although I am still providing tech support for the code and helping manage the server to this date). Throughout this time I wrote a lot of Perl and Python code to run on the university’s Linux server. The aim of these programs were as follows:

  • Download all the viral ref-seq genomes from GenBank;
  • BLAST a sequence of interest and retrieve all similar files;
  • Concatenate all sequences into one file that was run through CD-HIT;
  • Analyse the CD-HIT output, returning a file with the cluster numbers that sequences of interest may be found in;
  • Find variable length conserved regions of DNA within a designated cluster;
  • Ensure conserved region of DNA is completely dissimilar to that found in other virus clusters.

The result of the program should be several sequences of DNA conserved within a closely related group of viruses (or possibly just one species of virus) which can be used when designing diagnostic kits to detect the presence of these viruses.

Even though I did not know much about viruses, I joined this project because I had been keen to get some experience in the field of bioinformatics so I could try to combine my interests in science and programming. I found that I really enjoyed this type of work and would love to do more of it in the future.

While the code I wrote exists privately on the server (as it is currently being used), I will provide a sample of the code I wrote as a bit of a tutorial for beginners where I’ll show you how to retrieve the viral ref-seqs off GenBank. This program uses the python ftplib module:

#! /usr/bin/env python
from ftplib import FTP
import os
ftp = FTP("ftp.ncbi.nih.gov")
ftp.login('anonymous')
ftp.cwd("genomes/Viruses")
filematch = "all.fna.tar.gz"
for filename in ftp.nlst(filematch):
if filename == filematch:
fhandle = open(filename, 'wb')
ftp.retrbinary('RETR ' + filename, fhandle.write)
fhandle.close()

tar = tarfile.open("all.fna.tar.gz")
tar.extractall("refseqs")
tar.close()
os.remove("all.fna.tar.gz")
The following two tabs change content below.
Computational biology PhD candidate at the Australian National University. I love writing (both articles and software), learning more about the world around us, and beekeeping. I also write for BioSky.co

Latest posts by Jack Simpson (see all)

Comments are closed.