Data scientist at Port Jackson Partners in Sydney, Australia. My PhD was in computational biology. In my spare time I write about medical research at BioSky.co.CVAbout
This tutorial is a brief overview of what you can achieve using the Python BioPython module. Although I’m hoping to write up some more articles on this site for beginners when time permits, this post will assume that you have experience programming in Python and have a bit of an understanding of basic biological concepts such as DNA, restriction enzymes etc. If you’re still interested once you finish reading, feel free to consult the BioPython documentation, it will help give you a bit of an idea of how massive (and awesome) this module really is.
So to start I’ll show you how to install the BioPython module. While on Linux systems it can be as simple as typing ‘sudo apt-get install python-biopython’ or going to the Software Center, you can manually install a module by going to PyPI, downloading and extracting the file, opening the command-line or terminal and navigating into the root directory of the folder you just extracted and running the command ‘setup.py install’.
You will need to download two modules to install BioPython, each of which are hosted on their own site. The first is SciPy and the second is BioPython. Once you have installed these you’re ready to get into using BioPython.
One of the biggest strengths of BioPython is just how easy it makes manipulating files containing nucleotide and protein sequences. If you want to open a FASTA file, all you need is one line of code after you have imported the module:
from Bio import SeqIO oneseq = SeqIO.read('dnafile.fasta', 'fasta')
As you can see from the example code above, you just have to designate the file name and file type (in this case FASTA) and BioPython will do the rest. Once you imported the sequence into your program using the code above, you can easily access the different parts of the file:
print oneseq # Outputs general information about the sequence print oneseq.id # Identifying information about the sequence print oneseq.seq # This will output the entire sequence print len(oneseq) # Returns the length of the sequence
Although reading FASTA files on your computer is cool, BioPython lets you to make this process even faster, by allowing you to download files off GenBank and read them as though they were already sitting on your hard drive.
Entrez.email = "email@example.com" handle = Entrez.efetch(db="nucleotide", rettype="fasta",retmode="text", id="294489415") seqrecord = SeqIO.read(handle, "fasta") handle.close()
Here you are able to specify which database you’re interested in looking in (in this case nucleotide), the file type and id of the file you’re after. You then use the SeqIO module I just showed you above to read the file you have retrieved from the website.
BioPython has over 600 different restriction enzymes built into it which you can use to digest any nucleotide sequence you like. Using the ‘seqrecord’ variable from the code snippet above I’ll show you how you can use Bio.Restriction.
from Bio.Restriction import Restriction print Restriction.Sau3AI.site digest = Restriction.ApaI.catalyse(seqrecord.seq) print "Number of fragments is", len(digest) print digest
This program will tell you the site that the Sau3AI restriction enzyme will cut a DNA sequence, the number of fragments it will produce and the sequence of each fragment and their lengths. BioPython even has another part to it called RestrictionBatch where you can test many restriction enzymes against a DNA sequence at once.
This is just a taste of what you can easily do using the BioPython module, in the future I plan to write another article about automating BLAST searches and performing alignments with this module.