Find sequence descriptions with BioPython

Image you have a lot of nucleotide sequence identifiers and want to find out what organism the DNA is from. You could go to the NCBI website and spend a long time finding out, or you could write a short Python script using BioPython to find out the headers from each fasta file the identifier refers to:

import re
from Bio import Entrez
from Bio import SeqIO

#I had a bit of mess cluttering my identifiers, so I extracted them with regular expressions
all_id="'26245730': 817, '389595538': 735, '541129065': 529, '541129071': 340, '558870185': 305, '444325280': 287, '573974252': 272, '281314044': 222"
unique_id = re.findall("'(.*?)'",all_id)

email="my@email.com"
Entrez.email = email
for each_id in unique_id:
	fetch_seq = Entrez.efetch(db="nucleotide", rettype="fasta",retmode="text", id=each_id)
	seq_record = SeqIO.read(fetch_seq, "fasta")
	fetch_seq.close()
	print seq_record.description

This script will take a list of your sequence identifiers and print the descriptions from the fasta file.

The following two tabs change content below.
Computational biology PhD candidate at the Australian National University. I love writing (both articles and software), learning more about the world around us, and beekeeping. I also write for BioSky.co

Latest posts by Jack Simpson (see all)

Comments are closed.