Never trust your factors

I recently helped a friend out with a dataset – she was struggling to merge the CSV files from two dataframes in R into one dataframe. I thought this would be quite simple and yet could not get it to work with merge or dplyr – it just kept giving me weird results. The problem was that I was too trusting of the data that was input by human hand. Here’s what happened when I started to critically interrogate the data.

Firstly, I read in the CSV files – so far so good:

tree = read.csv("TreeData.csv", header = T)
full = read.csv("FullAnalysisRawData.csv", header = T)

Next I used the View command to see what the dataframes looked like in RStudio:

View(tree)
View(full)

Now in this particular dataset, I wanted to cbind (add the columns) of the rows with the same TreeID and Sample.type from the different CSV files. The first error in the data I noticed was that there was a duplicated record with the ID ‘B01(2)’ – so this was how I got rid of that row:

full_rm_dup <- full[!(full$TreeID == 'B01(2)'),]

Next, I noticed that the levels for the Sample.type factor were capitalised in one CSV file and not the other. The easiest way to fix this was to rename the factors in one of the files:

 
levels(tree$Sample.type) <- c("Bark", "Leaf")

Finally, I wrote a function that would go through each row in the first dataframe and look at the value for TreeID and Sample.type. It would then look for rows in the other dataframe that matched these two values. I decided to print out the resulting values from this like so:

merge_dfs <- function(each_row) {
 full_tree_id <- as.character(each_row[1])
 full_sample_type <- as.character(each_row[3])
 matching_row <- tree[(tree$TreeID == full_tree_id & tree$Sample.type == full_sample_type),]
 print(paste(full_tree_id, matching_row$TreeID, full_sample_type, matching_row$Sample.type))
}

apply( full, 1, function(each_row) merge_dfs(each_row))

The apply function in R applied my function to every row (that’s what the number 1 did, if I’d wanted to apply it to every column I would have used 2 for this argument instead). When I looked at the output from this function, I saw that there were quite a few rows where the TreeID of one CSV file did not match of the TreeIDs in the other file. I emailed all this to the researcher who now knew what was wrong with the dataset and could fix the mistakes that had occurred. With that done, the original merge function she was using worked perfectly.

Moral of the story: never trust that the factors in your dataset are correct – capitalisations, duplications and input mistakes can occur really easily and can be quite subtle in large datasets.

The following two tabs change content below.
Computational biology PhD candidate at the Australian National University. I love writing (both articles and software), learning more about the world around us, and beekeeping. I also write for BioSky.co

Latest posts by Jack Simpson (see all)

Comments are closed.