When to use ‘is’ and ‘==’ in Python

One of the things that may seem confusing in Python is how there appears to be two ways to test if variables are the same: ‘==’ and ‘is’:

x = 200
y = 200
print(x == y)
print(x is y)

Both comparison methods returned True, so they do the same thing right? Well, not really. To illustrate this, I’ll change the integer value assigned:

x = 500
y = 500
print(x == y)
print(x is y)

Now we have ‘==’ returning True and ‘is’ returning False. What happened? Well, ‘==’ is referred to as the equality operator, and it checks that the values of the variables are the same. That’s why it returned True both times. On the other hand, ‘is’ is referred to as the identity operator and compares the id of the object that the variable points to. I have a more detailed post on object ids available here.

So the moral of this is that usually you just want to use the equality operator (‘==’), as there is no guarantee that variables storing the same number will also point to the same object (and therefore have the same object id). The only times you really need to use the ‘is’ operator is when you explicitly need to check if one variable is actually pointing to the same object as another variable.

Read More
Python object ids and mutable types

Did you know that every object in your Python program is given a unique identifier by the interpreter which you can return using the ‘id()’ function? Let’s see what happens when we assign variables to each other in Python and then print out the variable value and object id:

x = 5
y = x
print(x, id(x))
print(y, id(y))

Now I am printing out two things about each variable: the value it stores (the number 5) and the object id. The object id for both variables should be the same. This is telling us that both the x and y variables are referring to the same object. Now, what happens if I were to change the y variable?

x = 5
y = x
y += 1
print(x, id(x))
print(y, id(y))

Here I’ve added 1 to the y variable, and as you can see, both the value and object id of the y variable is now different to the x variable. This change has occurred because integers in Python are immutable data types. Integers, floats, strings and tuples are examples of immutable objects in Python. What this means is that when you change the number a variable refers to, the Python interpreter points the variable to a completely different object.

Now, the distinction between mutable and immutable objects in Python starts to become more important when you start working with mutable data types like lists:

x = [1]
y = x
y.append(5)
print(x, id(x))
print(y, id(y))

The thing that may surprise a lot of people is that now the list both x and y points to has changed! Now the list has a 1 and a 5 in it, even though I only appended to y. This is very easy to get tripped up on if you don’t know what Python is really doing under the hood.

Now how do you get around this? Luckily the standard library has us covered with a handy package called ‘copy’:

import copy

x = [1]
y = copy.copy(x)
y.append(5)
print(x, id(x))
print(y, id(y))

Once you use the ‘copy’ method on x, you can see from the output that y has a new object id and that changes you make to the y variable no longer affect x. Problem solved, right?

Well, there’s now just one final thing to consider – what happens if I have a list of lists and I use the copy method?

import copy

x = [[1, 2], [3, 4]]
y = copy.copy(x)
y[0].append(5)
print(x, id(x))
print(y, id(y))

You can see that I’ve made a copy of the x variable, then appended to the first list in y (the one containing [1, 2]). However, when you see the output, you’ll notice that like before, the change made to y has also modified x. You can see the reason for this by looking at the object id of the first list of x and y:

print(x[0], id(x[0]))
print(y[0], id(y[0]))

Both lists refer to the same object, only the outer list was copied and received a new object id! To solve this, we’ll need to use the ‘deepcopy’ method. This will create unique copies of all the objects throughout the list – not just a copy of the outer list itself:

import copy

x = [[1, 2], [3, 4]]
y = copy.deepcopy(x)
y[0].append(5)
print(x, id(x))
print(y, id(y))

Now we should see that any changes to the lists in x or y will not affect the variable anymore.

So how is this useful? The reason I decided to write this post this evening was because some of the code I was working on today involved initialising multiple dictionaries with many keys. I could have copied and pasted them, but it was a lot easier and more readable for me to create an initial variable, then use ‘deepcopy’ to initialise the other variables. If I hadn’t been aware of Python object ids and mutable types, then I could very easily have created a massive bug in my code by creating several variables that all modified the same dictionary.

Read More
Closing a frozen SSH session

Normally, when I leave an SSH session idle too long and it freezes the terminal, I just close that tab and start a new one. I never bothered to look up what I could do to unfreeze it and continue. I recently came across this article which showed the 3 steps you need to follow to unfreeze your terminal in this situation. I’ve included the steps below so I can easily find it again in the future.

  1. [enter]
  2. ~
  3. .

 

Read More
Positioning a legend outside the figure with Matplotlib and Python

One of the things that has been a little frustrating lately has been what to do if you need a legend for your plot, yet there’s so much content on your plot you need to place it next to the figure, rather than within it.  The standard way to create a plot with the legend within it looks like this:

Standard Legend

import matplotlib.pyplot as plt
import numpy as np

plt.figure()
plt.plot([5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5], '-go', label='bottom_right_section')
plt.xlim([0,39])
plt.title('Bee Speed By Quarter of Frame')
plt.xlabel("Nights and Days")
plt.ylabel("Number of fast cells")
plt.legend(loc='upper left')
plt.show()

The first step I tried to move the legend outside the figure was to add this code to the “legend” method:

plt.legend(loc='upper left', bbox_to_anchor=(1,1)) 

bbox_test

Unfortunately, the legend was being cut-off on the right hand side. I then tried to shrink down the legend (as it was rather large) and when that didn’t work, I found out that I could pass a padding argument to the “tight_layout” method which finally solved the issue:

plt.legend(loc='upper left', prop={'size':6}, bbox_to_anchor=(1,1))
plt.tight_layout(pad=7)

matplotlib_padding

The final code to solve the problem with establishing a large margin around the figure:


import matplotlib.pyplot as plt
import numpy as np

plt.figure()
plt.plot([5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5], '-go', label='bottom_right_section')
plt.xlim([0,39])
plt.title('Bee Speed By Quarter of Frame')
plt.xlabel("Nights and Days")
plt.ylabel("Number of fast cells")
plt.legend(loc='upper left', prop={'size':6}, bbox_to_anchor=(1,1))
plt.tight_layout(pad=7)
plt.show()

Read More
Never trust your factors

I recently helped a friend out with a dataset – she was struggling to merge the CSV files from two dataframes in R into one dataframe. I thought this would be quite simple and yet could not get it to work with merge or dplyr – it just kept giving me weird results. The problem was that I was too trusting of the data that was input by human hand. Here’s what happened when I started to critically interrogate the data.

Firstly, I read in the CSV files – so far so good:

tree = read.csv("TreeData.csv", header = T)
full = read.csv("FullAnalysisRawData.csv", header = T)

Next I used the View command to see what the dataframes looked like in RStudio:

View(tree)
View(full)

Now in this particular dataset, I wanted to cbind (add the columns) of the rows with the same TreeID and Sample.type from the different CSV files. The first error in the data I noticed was that there was a duplicated record with the ID ‘B01(2)’ – so this was how I got rid of that row:

full_rm_dup <- full[!(full$TreeID == 'B01(2)'),]

Next, I noticed that the levels for the Sample.type factor were capitalised in one CSV file and not the other. The easiest way to fix this was to rename the factors in one of the files:

 
levels(tree$Sample.type) <- c("Bark", "Leaf")

Finally, I wrote a function that would go through each row in the first dataframe and look at the value for TreeID and Sample.type. It would then look for rows in the other dataframe that matched these two values. I decided to print out the resulting values from this like so:

merge_dfs <- function(each_row) {
 full_tree_id <- as.character(each_row[1])
 full_sample_type <- as.character(each_row[3])
 matching_row <- tree[(tree$TreeID == full_tree_id & tree$Sample.type == full_sample_type),]
 print(paste(full_tree_id, matching_row$TreeID, full_sample_type, matching_row$Sample.type))
}

apply( full, 1, function(each_row) merge_dfs(each_row))

The apply function in R applied my function to every row (that’s what the number 1 did, if I’d wanted to apply it to every column I would have used 2 for this argument instead). When I looked at the output from this function, I saw that there were quite a few rows where the TreeID of one CSV file did not match of the TreeIDs in the other file. I emailed all this to the researcher who now knew what was wrong with the dataset and could fix the mistakes that had occurred. With that done, the original merge function she was using worked perfectly.

Moral of the story: never trust that the factors in your dataset are correct – capitalisations, duplications and input mistakes can occur really easily and can be quite subtle in large datasets.

Read More
Python hmmlearn installation issues

I’ve recently started learning how to apply a Hidden Markov Model (HMM) to some states of honeybee behaviour in my data and have been trying to install Python’s hmmlearn library. Unfortunately I kept getting this frustrating error due to it being unable to locate NumPy headers:

 hmmlearn error: 'numpy/arrayobject.h' file not found 

After a bit of searching I found the solution in a ticket on GitHub, but I thought I’d include it here a) for my own future reference and b) I’ve updated the command to be applicable for Python 3.5.

 export CFLAGS=&quot;-I /usr/local/lib/python3.5/site-packages/numpy/core/include/ $CFLAGS&quot; 

Once you’ve run this then you should should be able to install hmmlearn via pip without any problem.

Read More
Hover Fly Mistaken For Bee
Mistakes in machine learning datasets

One of the things you realise once you start learning about machine learning is just how important a well-annotated dataset is that you can use for training. Your predictive model will only ever be as good as the labelled data you originally gave it.

A little while back when I was trying to learn how to use the deep learning library Caffe, I started watching a series of educational webinars that Nvidia had released for free. I can’t recommend these videos enough if you want to learn more about deep learning – I learned a huge amount from watching them. However, during this course, I noticed how one of the images used as an example of a “bee category” in the data looked like this:

Hover Fly Mistaken For Bee

Despite the black and yellow stripes, this insect is not a bee, it is actually a hover fly. I quickly found that images of these flies are misclassified all over the internet as bees – I only realised apparently because I’m one of the few bee researchers who was interested in machine learning.

I think that this misclassification is polluting a lot of the images labelled as containing bees in the major datasets that people use for training. I would urge any computer scientist who was interested in any project involving insect identification to be extremely careful and collaborate closely with entomologists before selecting images from a dataset to use for training.

This also raises the question: how do you test the accuracy of your model if you don’t know if the original labels have any basis in reality? I wonder how many other major categories in these datasets are mistakenly labelled because the people putting the data together don’t have the domain knowledge to make these kinds of subtle distinctions?

Read More
Compiling C++ Code Using Caffe

As part of my PhD project, I have been writing a program in C++ to track hundreds of bees that I have tagged and to identify the pattern on the tags. Initially, I had thought that recognising the tags would be rather simplistic – I could threshold out the tags that are reflective under IR light, and then measure the shape of the object within the tag. Unfortunately, the tags lost their reflectiveness after a short period of time and become smeared (amongst other issues), and thus I had to turn to machine learning to try to find a solution to my recognition problem.

To cut a long story short, the deep learning framework Caffe helped me to solve this problem after a lot of frustration (on the positive side I did learn a lot about machine learning in the process). However, there is one thing that I found surprisingly frustrating when using this framework: compiling your own standalone C++ program that linked to Caffe. For anyone in the same situation as me, this is how to achieve this (keeping in mind I was not using GPUs):

 

  1. At the top of your .cpp program which is linking to the Caffe library, you will need to make the following definition:
    #define CPU_ONLY
  2. Now if we try to compile anything now, Caffe will make this complaint:
    caffe/proto/caffe.pb.h: No such file or directory
    Some of the header files are missing from the Caffe include directory. Thus, you’ll need to generate them with these commands from within the Caffe root directory:
    protoc src/caffe/proto/caffe.proto --cpp_out=.
    mkdir include/caffe/proto
    mv src/caffe/proto/caffe.pb.h include/caffe/proto
  3. Finally, I copied libcaffe.so into /usr/lib and the caffe directory containing the header libraries ($caffe_root/include/caffe) into the /usr/include directory. To compile this on a Mac (after installing OpenBLAS with Homebrew), I just had to run:
    g++ classification.cpp -lcaffe -lglog -lopencv_core -lopencv_highgui -lopencv_imgproc -I /usr/local/Cellar/openblas/0.2.14_1/include -L /usr/local/Cellar/openblas/0.2.14_1/lib -o classifier
  4. Alternatively, you could do what I did on my Linux machine and instead of copying header files, I just linked directly to those directories when I compiled:
    g++ classification.cpp -lcaffe -lglog -lopencv_core -lopencv_highgui -lopencv_imgproc -I ~/caffe/include -L ~/caffe/build/lib -I /usr/local/Cellar/openblas/0.2.14_1/include -L /usr/local/Cellar/openblas/0.2.14_1/lib -o classifier
Read More
Interesting Readings

Work has kept me pretty busy lately but I’ve been meaning to try put together another post with some of the interesting readings I’ve come across. The first thing I’ll mention is that the IEEE (Institute of Electrical and Electronics Engineers) have released their rankings for programming language popularity. Python (ranked #4) and R (ranked #6) have continued to increase in popularity while Java remains at the top which is to be expected given how popular it is in industry. Below are some of the other articles I’ve found:

Big Data

R

Python

Version Control

Read More
Saving OpenCV matrices to disc and loading them

Recently I needed to save some matrices I had generated with OpenCV and C++. This snippet of code shows you how you can do this this and then open the files to retrieve the matrices.  You can save the files as either the .yml or .xml format. In the example below, “trainData” and “trainLabels” are the two matrices I want to save:

cv::FileStorage file("tag_data.yml", cv::FileStorage::WRITE);

file << "data" << trainData;
file << "classes" << trainLabels;
file.release();

// later (in another program when you want to get the stored matrices)

cv::Mat trainDataCopy, trainLabelsCopy;

cv::FileStorage file("tag_data.yml", cv::FileStorage::READ);

file["data"] >> trainDataCopy;
file["classes"] >> trainLabelsCopy;

file.release();

Done. You should now be able to use both matrices.

Read More