Python hmmlearn installation issues

I’ve recently started learning how to apply a Hidden Markov Model (HMM) to some states of honeybee behaviour in my data and have been trying to install Python’s hmmlearn library. Unfortunately I kept getting this frustrating error due to it being unable to locate NumPy headers:

 hmmlearn error: 'numpy/arrayobject.h' file not found 

After a bit of searching I found the solution in a ticket on GitHub, but I thought I’d include it here a) for my own future reference and b) I’ve updated the command to be applicable for Python 3.5.

 export CFLAGS="-I /usr/local/lib/python3.5/site-packages/numpy/core/include/ $CFLAGS" 

Once you’ve run this then you should should be able to install hmmlearn via pip without any problem.

Read More
Hover Fly Mistaken For Bee
Mistakes in machine learning datasets

One of the things you realise once you start learning about machine learning is just how important a well-annotated dataset is that you can use for training. Your predictive model will only ever be as good as the labelled data you originally gave it.

A little while back when I was trying to learn how to use the deep learning library Caffe, I started watching a series of educational webinars that Nvidia had released for free. I can’t recommend these videos enough if you want to learn more about deep learning – I learned a huge amount from watching them. However, during this course, I noticed how one of the images used as an example of a “bee category” in the data looked like this:

Hover Fly Mistaken For Bee

Despite the black and yellow stripes, this insect is not a bee, it is actually a hover fly. I quickly found that images of these flies are misclassified all over the internet as bees – I only realised apparently because I’m one of the few bee researchers who was interested in machine learning.

I think that this misclassification is polluting a lot of the images labelled as containing bees in the major datasets that people use for training. I would urge any computer scientist who was interested in any project involving insect identification to be extremely careful and collaborate closely with entomologists before selecting images from a dataset to use for training.

This also raises the question: how do you test the accuracy of your model if you don’t know if the original labels have any basis in reality? I wonder how many other major categories in these datasets are mistakenly labelled because the people putting the data together don’t have the domain knowledge to make these kinds of subtle distinctions?

Read More
Compiling C++ Code Using Caffe

As part of my PhD project, I have been writing a program in C++ to track hundreds of bees that I have tagged and to identify the pattern on the tags. Initially, I had thought that recognising the tags would be rather simplistic – I could threshold out the tags that are reflective under IR light, and then measure the shape of the object within the tag. Unfortunately, the tags lost their reflectiveness after a short period of time and become smeared (amongst other issues), and thus I had to turn to machine learning to try to find a solution to my recognition problem.

To cut a long story short, the deep learning framework Caffe helped me to solve this problem after a lot of frustration (on the positive side I did learn a lot about machine learning in the process). However, there is one thing that I found surprisingly frustrating when using this framework: compiling your own standalone C++ program that linked to Caffe. For anyone in the same situation as me, this is how to achieve this (keeping in mind I was not using GPUs):

 

  1. At the top of your .cpp program which is linking to the Caffe library, you will need to make the following definition:
    #define CPU_ONLY
  2. Now if we try to compile anything now, Caffe will make this complaint:
    caffe/proto/caffe.pb.h: No such file or directory
    Some of the header files are missing from the Caffe include directory. Thus, you’ll need to generate them with these commands from within the Caffe root directory:
    protoc src/caffe/proto/caffe.proto --cpp_out=.
    mkdir include/caffe/proto
    mv src/caffe/proto/caffe.pb.h include/caffe/proto
  3. Finally, I copied libcaffe.so into /usr/lib and the caffe directory containing the header libraries ($caffe_root/include/caffe) into the /usr/include directory. To compile this on a Mac (after installing OpenBLAS with Homebrew), I just had to run:
    g++ classification.cpp -lcaffe -lglog -lopencv_core -lopencv_highgui -lopencv_imgproc -I /usr/local/Cellar/openblas/0.2.14_1/include -L /usr/local/Cellar/openblas/0.2.14_1/lib -o classifier
  4. Alternatively, you could do what I did on my Linux machine and instead of copying header files, I just linked directly to those directories when I compiled:
    g++ classification.cpp -lcaffe -lglog -lopencv_core -lopencv_highgui -lopencv_imgproc -I ~/caffe/include -L ~/caffe/build/lib -I /usr/local/Cellar/openblas/0.2.14_1/include -L /usr/local/Cellar/openblas/0.2.14_1/lib -o classifier
Read More
Interesting Readings

Work has kept me pretty busy lately but I’ve been meaning to try put together another post with some of the interesting readings I’ve come across. The first thing I’ll mention is that the IEEE (Institute of Electrical and Electronics Engineers) have released their rankings for programming language popularity. Python (ranked #4) and R (ranked #6) have continued to increase in popularity while Java remains at the top which is to be expected given how popular it is in industry. Below are some of the other articles I’ve found:

Big Data

R

Python

Version Control

Read More
Saving OpenCV matrices to disc and loading them

Recently I needed to save some matrices I had generated with OpenCV and C++. This snippet of code shows you how you can do this this and then open the files to retrieve the matrices.  You can save the files as either the .yml or .xml format. In the example below, “trainData” and “trainLabels” are the two matrices I want to save:

cv::FileStorage file("tag_data.yml", cv::FileStorage::WRITE);

file << "data" << trainData;
file << "classes" << trainLabels;
file.release();

// later (in another program when you want to get the stored matrices)

cv::Mat trainDataCopy, trainLabelsCopy;

cv::FileStorage file("tag_data.yml", cv::FileStorage::READ);

file["data"] >> trainDataCopy;
file["classes"] >> trainLabelsCopy;

file.release();

Done. You should now be able to use both matrices.

Read More
Be careful with FFmpeg metadata

Recently I used the program FFmpeg to automatically extract metadata about the time and date some videos I’ve been using for my research were created. FFmpeg is a really useful tool for manipulating videos and images at the command-line. You can do things like change the format, quality or length of your video files with relative ease. The image processing library OpenCV even uses FFmpeg under the hood for opening videos you want to process. You can also use FFmpeg to extract metadata about your video files, and that is what I was using it for. However, I soon discovered that the metadata I retrieved with the program wasn’t completely accurate. Video file creation times were off by a couple of hours. Normally that doesn’t matter much, however if you’re breaking your data down hour-by-hour over a 2 week period, having these kinds of discrepancies really throws your analysis out. It was a great reminder to always thoroughly check the output of your analysis pipelines and to never assume that a tool will work perfectly all the time.

Read More
Image processing with scikit-image

I’ve been using OpenCV over the past couple of years for all my image processing work. It’s a really extensive and useful library, and just recently, OpenCV 3.0 was released, which included bindings for Python 3! Finally one of my last reasons to continue using Python 2 has disappeared. However, there is another image processing library that I just found out about – scikit-image. Yes, that’s the same scikit as the one from scikit-learn, the awesome machine learning library for Python that I’d linked to previously in my machine learning resources post. While I plan to continue using OpenCV, I had a read through some of the scikit-image tutorials with interest, and if you’re looking for a light-weight library for image processing that plays well with SciPy, I’d definitely check it out.

If you’re interested in learning more about image processing with OpenCV, I’m preparing a presentation for the Canberra Python Users Group using the iPython Notebook which provides an introduction to using this library. I’ll upload the notebook to GitHub and write a post linking to it in a few weeks.

Read More
Machine Learning Graphics from Melanie Warrick’s PyCon 2014 Presentation

Recently, I watched Melanie Warrick’s great talk on getting started with machine learning at PyCon 2014. I linked to it in my post on materials to learn machine learning, however I wanted to write another post with pictures of some slides I took from her presentation which I found incredibly helpful. I strongly recommend watching the video and then using these slides when trying to figure out what you should do when working on a machine learning problem.

Machine Learning Project Flow
Machine Learning Project Flow
Machine Learning Algorithms
Machine Learning Algorithms
Read More
Resources for Learning Haskell

I’ve had a few recommendations lately from experienced programmers that I should learn a “pure” functional language. This is supposed to allow me to think about problems and programming in a different way. I’ve used some functional aspects of Python before (like map), however, this will be my first real attempt at a proper functional language: Haskell.

As I’ve been reading up on the language and running through tutorials, I thought I’d start up a list of learning materials I found rather helpful:

Read More
PyCon 2015 Talks I Found Interesting

Here are some of the PyCon 2015 talks I found rather interesting, please note that some of the talks that refer to machine learning have been linked to in my post here.

Read More