Data scientist at Port Jackson Partners in Sydney, Australia. My PhD was in computational biology. In my spare time I write about medical research at BioSky.co.CVAbout
Line two of the zen of Python reads “explicit is better than implicit” and until relatively recently I never truly appreciated the wisdom of those words. My change of heart stems from a series of Python scripts, where a large portion of my code dealt with automating and retrieving the results of a BLAST search using the fantastic BioPython toolkit. I was filtering my results based on an expect value of 0.04, which during my initial testing worked perfectly. However, as I wanted to make this value variable, I rewrote it into the program as a command-line argument. What I had not considered (but definitely should have) was how Python implicitly processes a command-line argument – as a string! I was never thrown an error – the program continued to work, so I assumed it was still doing the job just like in my tests. However, behind the scenes the filtering of my results had completely ceased to function.
The second (and in my opinion less obvious) issue of I have had with implicit design decisions relates to the qblast method from BioPython. As far as I could see, I was retrieving plenty of sequences, therefore the program must be working. However, my PI was rather suspicious of how so few sequences were being retrieved compared to the hundreds that were coming up when she would BLAST our sequence with the online interface. I searched numerous sites and went through the BioPython documentation but could find no mention of a sequence retrieval cut-off. Finally in desparation I went through the source code itself from the module that I was using and found this:
A default limit of 50 unless overruled! A few extra characters and the problem was fixed, but until this discovery I was having serious problems later in the pipeline that I could not understand.
[Edit] After a Twitter conversation with Peter from the BioPython Project, I’d like to add that this issue is due to the default settings of the online BLAST tool I was calling, as well as potentially the settings of the BioPython wrapper. A good lesson in understanding the defaults of the tools (BLAST) your tools (BioPython) are calling!
These are two of the most recent examples that I have seen of how aware developers need to be about the implicit default values and methods that are present in the language and library they are calling. I hope my mistakes and learning experience will be useful to others who may come across similar issues.
This was originally posted by myself on the Australian Bioinformatics Network.