Recently I used the program FFmpeg to automatically extract metadata about the time and date some videos I’ve been using for my research were created. FFmpeg is a really useful tool for manipulating videos and images at the command-line. You can do things like change the format, quality or length of your video files with relative ease. The image processing library OpenCV even uses FFmpeg under the hood for opening videos you want to process. You can also use FFmpeg to extract metadata about your video files, and that is what I was using it for. However, I soon discovered that the metadata I retrieved with the program wasn’t completely accurate. Video file creation times were off by a couple of hours. Normally that doesn’t matter much, however if you’re breaking your data down hour-by-hour over a 2 week period, having these kinds of discrepancies really throws your analysis out. It was a great reminder to always thoroughly check the output of your analysis pipelines and to never assume that a tool will work perfectly all the time.
Tips & Tutorials
In this tutorial, I’m going to provide an introduction to the basics of the programming language C++. I’ll describe how to compile your first program with the gcc compiler on Linux and Mac, although the code should also work on Windows using the Visual Studio compiler. If you’re new to programming, I’d probably recommend getting started with an easier scripting language like Python, before you get into C++. That said, hopefully the information in this post should still be useful for the complete beginner.
What is C++?
C++ was invented by Bjarne Stroustrup in 1978 as an extension of the popular C programming language. It added classes and the ability to write object orientated code.
Is C++ worth learning?
Although C and C++ came out decades ago, they remain extremely popular when speed and performance are a crucial requirement. Although C is a great language, C++ added a lot of functionality which comes in handy for big projects. While knowing C will give you a head-start when understanding C++, it is not essential. Personally, I started off learning C before changing to C++ because the imaging library I use for a lot of my work (OpenCV) is written in C++.
Image you have a lot of nucleotide sequence identifiers and want to find out what organism the DNA is from. You could go to the NCBI website and spend a long time finding out, or you could write a short Python script using BioPython to find out the headers from each fasta file the identifier refers to:
Before today, the only real use I’d had for regular expressions in Python was to just find the first instance of a pattern. For example, if I want to find the contents of the text between the first set of single quotation marks (in this case ‘26245730’), I would proceed like so:
import re all_id="'26245730': 817, '389595538': 735, '541129065': 529, '541129071': 340, '558870185': 305, '444325280': 287, '573974252': 272, '281314044': 222" first_id = re.search("'(.*?)'",all_id) print first_id.group(1)
The arguments passed to re.search define the pattern I am looking for: The single quotation marks on either side of the brackets show that I am looking for a pattern between them. The “.” within the brackets tells Python that I am happy with finding any character, number, etc and the “*” next to these mean it will look for 0 or more instances of this text. Finally, the “?” ensures that the expression isn’t greedy. What does it mean to be greedy with a regular expression? It means that instead of finding the pattern between the first two single quotation marks, it will find the pattern between the first and the last quotation marks! So I’ll end up with practically all of my string being returned!
When I first started learning OpenCV, I was working exclusively with Python. While I am still a huge fan of the language, today all of my OpenCV programs are written in C++. Why?
- Some of the deeper functionality of OpenCV has not been completely ported to Python (although hopefully the release of OpenCV 3.0 will fix most of these issues).
- Most in-depth textbooks on image processing and computer vision that cover OpenCV use C++ as their primary language. It was therefore easier to learn from these resources by adopting the language.
- A lot of the computer vision techniques I use (SIFT, machine learning, etc), are better documented in C++.
- Passing images back and forth between NumPy arrays has overhead that C++ doesn’t have to worry about.
- As has been suggested to me by Carl Bell, Python struggles to perform well with overloaded functions.
Python’s random module makes it extremely easy to generate random DNA bases.
import random dna = ["A","G","C","T"] #output a random base print(random.choice(dna))
Now to generate a specific number of random bases, all we have to do is use Python’s range function:
If you want to compile and run a C++ program using OpenCV in Sublime Text, then copy and paste the code below into a build system file. If you’re not interested in explicitly using C++11 you can delete the “-std=c++0x” section.
{ "shell_cmd": "g++ -std=c++0x ${file} -o ${file_base_name} `pkg-config --cflags --libs opencv` && ./${file_base_name}" }
This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.
Index
The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).
Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.
This tutorial is a brief overview of what you can achieve using the Python BioPython module. Although I’m hoping to write up some more articles on this site for beginners when time permits, this post will assume that you have experience programming in Python and have a bit of an understanding of basic biological concepts such as DNA, restriction enzymes etc. If you’re still interested once you finish reading, feel free to consult the BioPython documentation, it will help give you a bit of an idea of how massive (and awesome) this module really is.
So to start I’ll show you how to install the BioPython module. While on Linux systems it can be as simple as typing ‘sudo apt-get install python-biopython’ or going to the Software Center, you can manually install a module by going to PyPI, downloading and extracting the file, opening the command-line or terminal and navigating into the root directory of the folder you just extracted and running the command ‘setup.py install’.
You will need to download two modules to install BioPython, each of which are hosted on their own site. The first is SciPy and the second is BioPython. Once you have installed these you’re ready to get into using BioPython.
Pyral (Python + Viral) was the name of a project I worked on in Dr Joanne Macdonald’s lab between September 2012 – January 2013 (although I am still providing tech support for the code and helping manage the server to this date). Throughout this time I wrote a lot of Perl and Python code to run on the university’s Linux server. The aim of these programs were as follows:
- Download all the viral ref-seq genomes from GenBank;
- BLAST a sequence of interest and retrieve all similar files;
- Concatenate all sequences into one file that was run through CD-HIT;
- Analyse the CD-HIT output, returning a file with the cluster numbers that sequences of interest may be found in;
- Find variable length conserved regions of DNA within a designated cluster;
- Ensure conserved region of DNA is completely dissimilar to that found in other virus clusters.