Computational biology PhD researcher. Interested in science, software development, and machine learning. I write about medical research at BioSky.co and contribute content to a variety of additional publications.CVAbout
Image you have a lot of nucleotide sequence identifiers and want to find out what organism the DNA is from. You could go to the NCBI website and spend a long time finding out, or you could write a short Python script using BioPython to find out the headers from each fasta file the identifier refers to:
import re from Bio import Entrez from Bio import SeqIO #I had a bit of mess cluttering my identifiers, so I extracted them with regular expressions all_id="'26245730': 817, '389595538': 735, '541129065': 529, '541129071': 340, '558870185': 305, '444325280': 287, '573974252': 272, '281314044': 222" unique_id = re.findall("'(.*?)'",all_id) email="firstname.lastname@example.org" Entrez.email = email for each_id in unique_id: fetch_seq = Entrez.efetch(db="nucleotide", rettype="fasta",retmode="text", id=each_id) seq_record = SeqIO.read(fetch_seq, "fasta") fetch_seq.close() print seq_record.description
This script will take a list of your sequence identifiers and print the descriptions from the fasta file.
Latest posts by Jack Simpson (see all)
- Fruit flies, honeybees, and alcohol - October 18, 2017
- I’m going to Silicon Valley! - October 15, 2017
- Sometimes working in the biology department is pretty neat - October 10, 2017