Data scientist at Port Jackson Partners in Sydney, Australia. My PhD was in computational biology. In my spare time I write about medical research at BioSky.co.CVAbout
Image you have a lot of nucleotide sequence identifiers and want to find out what organism the DNA is from. You could go to the NCBI website and spend a long time finding out, or you could write a short Python script using BioPython to find out the headers from each fasta file the identifier refers to:
import re from Bio import Entrez from Bio import SeqIO #I had a bit of mess cluttering my identifiers, so I extracted them with regular expressions all_id="'26245730': 817, '389595538': 735, '541129065': 529, '541129071': 340, '558870185': 305, '444325280': 287, '573974252': 272, '281314044': 222" unique_id = re.findall("'(.*?)'",all_id) email="firstname.lastname@example.org" Entrez.email = email for each_id in unique_id: fetch_seq = Entrez.efetch(db="nucleotide", rettype="fasta",retmode="text", id=each_id) seq_record = SeqIO.read(fetch_seq, "fasta") fetch_seq.close() print seq_record.description
This script will take a list of your sequence identifiers and print the descriptions from the fasta file.