Computational biology PhD researcher. Interested in science, software development, and machine learning. I write about medical research at BioSky.co and contribute content to a variety of additional publications.CVAbout
Pyral (Python + Viral) was the name of a project I worked on in Dr Joanne Macdonald’s lab between September 2012 – January 2013 (although I am still providing tech support for the code and helping manage the server to this date). Throughout this time I wrote a lot of Perl and Python code to run on the university’s Linux server. The aim of these programs were as follows:
- Download all the viral ref-seq genomes from GenBank;
- BLAST a sequence of interest and retrieve all similar files;
- Concatenate all sequences into one file that was run through CD-HIT;
- Analyse the CD-HIT output, returning a file with the cluster numbers that sequences of interest may be found in;
- Find variable length conserved regions of DNA within a designated cluster;
- Ensure conserved region of DNA is completely dissimilar to that found in other virus clusters.
The result of the program should be several sequences of DNA conserved within a closely related group of viruses (or possibly just one species of virus) which can be used when designing diagnostic kits to detect the presence of these viruses.
Even though I did not know much about viruses, I joined this project because I had been keen to get some experience in the field of bioinformatics so I could try to combine my interests in science and programming. I found that I really enjoyed this type of work and would love to do more of it in the future.
While the code I wrote exists privately on the server (as it is currently being used), I will provide a sample of the code I wrote as a bit of a tutorial for beginners where I’ll show you how to retrieve the viral ref-seqs off GenBank. This program uses the python ftplib module:
#! /usr/bin/env python from ftplib import FTP import os ftp = FTP("ftp.ncbi.nih.gov") ftp.login('anonymous') ftp.cwd("genomes/Viruses") filematch = "all.fna.tar.gz" for filename in ftp.nlst(filematch): if filename == filematch: fhandle = open(filename, 'wb') ftp.retrbinary('RETR ' + filename, fhandle.write) fhandle.close() tar = tarfile.open("all.fna.tar.gz") tar.extractall("refseqs") tar.close() os.remove("all.fna.tar.gz")