I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.
The Python libraries I’ll frequently pip3 install
include:
The other day I was working with R in a Jupyter Notebook when I discovered that I needed to include multiple figures in the same plot.
Surprisingly, R doesn’t include this capability out of the box, so I went searching and found this function that does the job. I’ve included the code below for my own future reference in case the linked site ever disappears.
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) { library(grid) # Make a list from the ... arguments and plotlist plots <- c(list(...), plotlist) numPlots = length(plots) # If layout is NULL, then use 'cols' to determine layout if (is.null(layout)) { # Make the panel # ncol: Number of columns of plots # nrow: Number of rows needed, calculated from # of cols layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, nrow = ceiling(numPlots/cols)) } if (numPlots==1) { print(plots[[1]]) } else { # Set up the page grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:numPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } } }
Then to call the function, you just have to pass it the plots:
multiplot(p1, p2, p3, p4, cols=2)
I recently helped a friend out with a dataset – she was struggling to merge the CSV files from two dataframes in R into one dataframe. I thought this would be quite simple and yet could not get it to work with merge or dplyr – it just kept giving me weird results. The problem was that I was too trusting of the data that was input by human hand. Here’s what happened when I started to critically interrogate the data.
Firstly, I read in the CSV files – so far so good:
tree = read.csv("TreeData.csv", header = T) full = read.csv("FullAnalysisRawData.csv", header = T)
Next I used the View command to see what the dataframes looked like in RStudio:
View(tree) View(full)
Now in this particular dataset, I wanted to cbind (add the columns) of the rows with the same TreeID and Sample.type from the different CSV files. The first error in the data I noticed was that there was a duplicated record with the IDÂ ‘B01(2)’ – so this was how I got rid of that row:
full_rm_dup <- full[!(full$TreeID == 'B01(2)'),]
Next, I noticed that the levels for the Sample.type factor were capitalised in one CSV file and not the other. The easiest way to fix this was to rename the factors in one of the files:
levels(tree$Sample.type) <- c("Bark", "Leaf")
Finally, I wrote a function that would go through each row in the first dataframe and look at the value for TreeID and Sample.type. It would then look for rows in the other dataframe that matched these two values. I decided to print out the resulting values from this like so:
merge_dfs <- function(each_row) { full_tree_id <- as.character(each_row[1]) full_sample_type <- as.character(each_row[3]) matching_row <- tree[(tree$TreeID == full_tree_id & tree$Sample.type == full_sample_type),] print(paste(full_tree_id, matching_row$TreeID, full_sample_type, matching_row$Sample.type)) } apply( full, 1, function(each_row) merge_dfs(each_row))
The apply function in R applied my function to every row (that’s what the number 1 did, if I’d wanted to apply it to every column I would have used 2 for this argument instead). When I looked at the output from this function, I saw that there were quite a few rows where the TreeID of one CSV file did not match of the TreeIDs in the other file. I emailed all this to the researcher who now knew what was wrong with the dataset and could fix the mistakes that had occurred. With that done, the original merge function she was using worked perfectly.
Moral of the story: never trust that the factors in your dataset are correct – capitalisations, duplications and input mistakes can occur really easily and can be quite subtle in large datasets.
Work has kept me pretty busy lately but I’ve been meaning to try put together another post with some of the interesting readings I’ve come across. The first thing I’ll mention is that the IEEE (Institute of Electrical and Electronics Engineers) have released their rankings for programming language popularity. Python (ranked #4) and R (ranked #6) have continued to increase in popularity while Java remains at the top which is to be expected given how popular it is in industry. Below are some of the other articles I’ve found:
Big Data
R
Python
Version Control
I thought I’d start a list of some code examples I’ve found online which enable you to perform parallel operations in R and take advantage of multi-core processors.
I’ll try to add to this list from time-to-time when I come across new examples.
Here’s a list of a few R tutorials (in addition to the one I wrote), which I’ve found (or look) rather useful:
Google Developers R Tutorials
A slightly Different Introduction to R
The R package ggplot2 is one of the best data visualisation tools I’ve come across, and while it simplifies generating impressive graphics, there’s still a bit you have to learn to use it. Here are a few of the posts I’ve found really handy when using this package:
I’ll try to continue updating this list as I find other good resources.
This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.
Index
The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).
Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.
Before I made the switch to developing on a Linux machine, I noticed that the Python module for calling R (RPy2) seemed to be having some problems on Windows. This gave me an excuse to play around with writing my own Python script to create and run an R script. As you’ll see in the code below, I’ve used the subprocess module to execute the R script I created and then pipe the results back into the terminal.
import subprocess # rscript is the content of the R file we'll create rscript = '''#!/usr/bin/Rscript cat('This is a simple program') # same as c() NumOfIterations for (i in 1:NumOfIterations) { # 1:10 is range cat(i, 'Hello world!') cat(' ') } ''' r_file = open("example.r", "w") r_file.write(rscript) r_file.close() r_file_name = "example.r" ppath = r'C:Program FilesRR-2.15.2bini386Rscript.exe' proc = subprocess.Popen("%s %s" % (ppath, r_file_name), stdout=subprocess.PIPE) output = proc.stdout.read() print output