Home » Find sequence descriptions with BioPython

Find sequence descriptions with BioPython

by Jack Simpson

Image you have a lot of nucleotide sequence identifiers and want to find out what organism the DNA is from. You could go to the NCBI website and spend a long time finding out, or you could write a short Python script using BioPython to find out the headers from each fasta file the identifier refers to:

import re
from Bio import Entrez
from Bio import SeqIO

#I had a bit of mess cluttering my identifiers, so I extracted them with regular expressions
all_id="'26245730': 817, '389595538': 735, '541129065': 529, '541129071': 340, '558870185': 305, '444325280': 287, '573974252': 272, '281314044': 222"
unique_id = re.findall("'(.*?)'",all_id)

email="my@email.com"
Entrez.email = email
for each_id in unique_id:
fetch_seq = Entrez.efetch(db="nucleotide", rettype="fasta",retmode="text", id=each_id)
seq_record = SeqIO.read(fetch_seq, "fasta")
fetch_seq.close()
print seq_record.description

This script will take a list of your sequence identifiers and print the descriptions from the fasta file.

You may also like