Here’s a list of a few R tutorials (in addition to the one I wrote), which I’ve found (or look) rather useful:
Google Developers R Tutorials
A slightly Different Introduction to R
Currently fascinated by energy markets and electrical engineering. In another life I was a beekeeper that did a PhD in computational biology writing image analysis software and using machine learning to quantify honeybee behaviour in the hive.
Here’s a list of a few R tutorials (in addition to the one I wrote), which I’ve found (or look) rather useful:
Google Developers R Tutorials
A slightly Different Introduction to R
One of the best things about the iPython notebook is the number of easy-to-follow tutorials it has inspired. I thought I’d share a few that I’ve found on machine learning and statistics.
I’ll continue to update the list as I find new notebooks I find handy.
The R package ggplot2 is one of the best data visualisation tools I’ve come across, and while it simplifies generating impressive graphics, there’s still a bit you have to learn to use it. Here are a few of the posts I’ve found really handy when using this package:
I’ll try to continue updating this list as I find other good resources.
The project I am working on at present is focused on building an automated system for tracking the movement of honeybees in an observation hive, filmed using a grayscale camera and infrared light. My recent attempt at this has involved extracting features (or keypoints) from the frames and then describing them using the SIFT (Scale Invariant Feature Transform) algorithm, implemented in the OpenCV imaging library. There are a number of other (potentially faster) feature detection algorithms available (SURF, ORB, etc), however as SIFT has been traditionally regarded as the most robust, I thought I would start with it. SIFT works first by extracting what it deems to be robust keypoints/features. For each of these, it will then compute a unique feature descriptor based on the 16×16 pixel area of the feature. By extracting and describing these features, you can try to compare images as shown below.
I commenced my PhD in bioinformatics at the Australian National University about a month ago and thought I’d share some of the tools I’ve found absolutely essential. All of these tools are free to use (some do have paid plans, however at this point I haven’t had the need to sign up for them).
Line two of the zen of Python reads “explicit is better than implicit” and until relatively recently I never truly appreciated the wisdom of those words. My change of heart stems from a series of Python scripts, where a large portion of my code dealt with automating and retrieving the results of a BLAST search using the fantastic BioPython toolkit. I was filtering my results based on an expect value of 0.04, which during my initial testing worked perfectly. However, as I wanted to make this value variable, I rewrote it into the program as a command-line argument. What I had not considered (but definitely should have) was how Python implicitly processes a command-line argument – as a string! I was never thrown an error – the program continued to work, so I assumed it was still doing the job just like in my tests. However, behind the scenes the filtering of my results had completely ceased to function.
The second (and in my opinion less obvious) issue of I have had with implicit design decisions relates to the qblast method from BioPython. As far as I could see, I was retrieving plenty of sequences, therefore the program must be working. However, my PI was rather suspicious of how so few sequences were being retrieved compared to the hundreds that were coming up when she would BLAST our sequence with the online interface. I searched numerous sites and went through the BioPython documentation but could find no mention of a sequence retrieval cut-off. Finally in desparation I went through the source code itself from the module that I was using and found this:
A default limit of 50 unless overruled! A few extra characters and the problem was fixed, but until this discovery I was having serious problems later in the pipeline that I could not understand.
[Edit] After a Twitter conversation with Peter from the BioPython Project, I’d like to add that this issue is due to the default settings of the online BLAST tool I was calling, as well as potentially the settings of the BioPython wrapper. A good lesson in understanding the defaults of the tools (BLAST) your tools (BioPython) are calling!
These are two of the most recent examples that I have seen of how aware developers need to be about the implicit default values and methods that are present in the language and library they are calling. I hope my mistakes and learning experience will be useful to others who may come across similar issues.
This was originally posted by myself on the Australian Bioinformatics Network.
This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.
Index
The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).
Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.
This tutorial is a brief overview of what you can achieve using the Python BioPython module. Although I’m hoping to write up some more articles on this site for beginners when time permits, this post will assume that you have experience programming in Python and have a bit of an understanding of basic biological concepts such as DNA, restriction enzymes etc. If you’re still interested once you finish reading, feel free to consult the BioPython documentation, it will help give you a bit of an idea of how massive (and awesome) this module really is.
So to start I’ll show you how to install the BioPython module. While on Linux systems it can be as simple as typing ‘sudo apt-get install python-biopython’ or going to the Software Center, you can manually install a module by going to PyPI, downloading and extracting the file, opening the command-line or terminal and navigating into the root directory of the folder you just extracted and running the command ‘setup.py install’.
You will need to download two modules to install BioPython, each of which are hosted on their own site. The first is SciPy and the second is BioPython. Once you have installed these you’re ready to get into using BioPython.
Pyral (Python + Viral) was the name of a project I worked on in Dr Joanne Macdonald’s lab between September 2012 – January 2013 (although I am still providing tech support for the code and helping manage the server to this date). Throughout this time I wrote a lot of Perl and Python code to run on the university’s Linux server. The aim of these programs were as follows:
The main use I’ve found for the Python sys module is allowing command-line arguments to be made to a script. Here is an example of how it looks:
import sys
if len(sys.argv) == 2:
    input_file = sys.argv[1]
else:
    print "Please input a command-line argument specifying the file"
This script checks that 2 command-line arguments had been passed to the program before assigning the value sys.argv[1] to a variable. We check for two command-line arguments because the first one (sys.argv[0]) is the name of the Python script currently being executed.