One of the things you realise once you start learning about machine learning is just how important a well-annotated dataset is that you can use for training. Your predictive model will only ever be as good as the labelled data you originally gave it.
A little while back when I was trying to learn how to use the deep learning library Caffe, I started watching a series of educational webinars that Nvidia had released for free. I can’t recommend these videos enough if you want to learn more about deep learning – I learned a huge amount from watching them. However, during this course, I noticed how one of the images used as an example of a “bee category” in the data looked like this:
Despite the black and yellow stripes, this insect is not a bee, it is actually a hover fly. I quickly found that images of these flies are misclassified all over the internet as bees – I only realised apparently because I’m one of the few bee researchers who was interested in machine learning.
I think that this misclassification is polluting a lot of the images labelled as containing bees in the major datasets that people use for training. I would urge any computer scientist who was interested in any project involving insect identification to be extremely careful and collaborate closely with entomologists before selecting images from a dataset to use for training.
This also raises the question: how do you test the accuracy of your model if you don’t know if the original labels have any basis in reality? I wonder how many other major categories in these datasets are mistakenly labelled because the people putting the data together don’t have the domain knowledge to make these kinds of subtle distinctions?