Data scientist at Port Jackson Partners in Sydney, Australia. My PhD was in computational biology. In my spare time I write about medical research at BioSky.co.CVAbout
This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.
The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).
Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.
As you can see from the image, my RStudio setup has the console (where data is output by R) on the top left, the source code editor on the top right, the workspace and history on the bottom left and the files, plots, packages and help section on the bottom right. I should note now that you can actually execute commands directly through the R console but you cannot save any of this code – it has to be written in the source code for you to save it. To achieve a similar layout to the to my setup (if that is what you’d like) then you need to select Tools>Options>Pane Layout and make the sections shown in the image below.
Now that everything is installed, open RStudio up and select File>New>R Script. You should see in the top right pane that a blank box has appeared where you can input text. Save this into a folder of your choosing, making sure you end the name with the file type “.R” – for instance I saved my file as “learning.R”. You’re now ready to start writing scripts in R.
To get started we’re going to do a few simple equations directly in the R console (top left pane). Click next to the > and input the code below, pressing enter at the end of each line. I should note here that you do not have to write the “#” and the text that follows it (such as “# addition”). This is known as a comment, and R will ignore anything that comes after a “#” on a line. Comments are really useful once you start writing large scripts to remind you about the purpose of sections of your code. Although at the time they might seem like a bit of a hassle, you’ll thank yourself 6 months down the track when you dust off an old script and try to recall the way your previous analysis had worked.
3+2 # addition 4-1 # substraction 10/2 # division 5*3 # multiplication 2^3 # powers sqrt(36) # square root exp(5) # exponent pi
After you have done this, your console should look a little like this:
Alternatively if you write this out in the source pane (top right), you’ll see that pressing enter allows you to drop down to the next line. To execute a line of code you need to hold down control and press enter. If you want to run multiple lines of code at once you should highlight them before holding down control and pressing enter. You should see the output appear in the console pane.
The next important thing to learn about R is that you can assign data or values to a variable using the “<-” operation. Variables are effectively place holder names for values or operations you perform. Try executing the following code:
data <- 25 more_data <- sqrt(36) data # outputs the value for data more_data # outputs the value for more_data data * more_data # you can perform operations on variables
Variables are extremely useful, although there are a few things worth knowing about them:
- Give them meaningful names – don’t call all your variables var1, var2, etc. Give them names that helps you refer to the data or operation they’re pointing to – for example mean_distance, fish_lengths etc.
- Notice how I used the underscore rather than a space for the space between the names in the variables. This is because R does not permit you to have spaces in variable names – use hyphens or underscores instead.
- Additionally, a variable name may not start with a full stop (“.”) or a number, although they can have numbers and full stops in them after the first character which should be a letter. For example the variable name “1chicken” is not valid but “chicken1” is.
- R is case-sensative – so it treats the variable name Var1 and var1 as completely different.
- If you give a variable a value and then later assign the same name a new value, it will overwrite the old value.
To remove a variable, you need to use the remove function – “rm”:
rm(more_data) # remove variable rm(data, more_data) # remove multiple variables rm(list=ls()) # erase all data in workspace including variables
Vectors are simply a list of multiple values which you will often assign to a variable. For instance, if you were measuring the weights of dogs, you might have an Excel spreadsheet with a column with the title ‘weights’ with a series of values from your measurements such as “25, 10, 23, 12, 32, 10, 8, 3, 30”. So how would we write this in R?
dog_weight <- c(25, 10, 23, 12, 32, 10, 8, 3, 30)
So what has this done? The “c” in this code stands for the R combine function, and we have just told R to assign all these values to the variable “dog_weights”. So what can we find out about this dataset?
length(dog_weight) # number of measurements taken mean(dog_weight) # mean median(dog_weight) # median var(dog_weight) # variance sd(dog_weight) # standard deviation range(dog_weight) # range max(dog_weight) min(dog_weight) summary(dog_weight) # summary table
Something you may notice missing from this list is how to calculate the standard error for a dataset, however you can easily write this yourself as it is just the standard deviation of the data divided by the square root of the number of measurements taken:
sd(dog_weight) / sqrt(length(dog_weight))
If we wanted to select a value individually from a vector in R, we refer to it using the variable name and a number in brackets which describes its location. For instance:
dog_weight # outputs first value: 25 dog_weight[1:3] # Outputs the first 3 values dog_weight[-1] # return complete vector except for the first element
It is also possible to rapidly create vectors in R if you just need a list of numbers by using the combine function as shown below. However, this is automatically set to increment by lots of 1, so if you want to alter that you’ll have to use the “seq” function:
a_list <- c(5:100) # creates vector with numbers between 5-100 b_list <- seq(from=1, to=20, by=3) # seq function incrementing up by 3 each time b_list <- seq(from=1, to=20, by=3) # seq function incrementing up by 3 each time c_list <- seq(from=0, to=200, length.out=10) # R calculates the vector based on your request for 10 numbers between 0 to 200 d_list <- rep(100, times=25) # repeats number
It is worth noting that you cannot mix vector data types in R – in other words you can have a vector with numbers, logical values or strings (words) but you cannot have a vector with more than one of these types. Also, trying to combine two vectors into one results in them both combining to form one list, you do not end up with a vector containing two separate vectors. If you do want to combine multiple data types together, you should use the “list” function:
a_list <- list(1, "Jack", TRUE, c(1,2,3,4))
Vector operations allow you to perform equations on the elements of different vector datasets. In the examples below, each element in the list from “vector_a” is modified by each element in “vector_b”, the result of the first interaction being 25. You can also do this with division, addition etc.
vector_a <- c(5, 2, 3, 6) vector_b <- c(5, 12, 13, 2) vector_a * vector_b vector_a * 2 # every element in the vector is multiplied by 2
To create a matrix in R, you use the matrix function:
Before I move on, I should note that R has a couple of help commands built in which you can use when trying to work out what a function does. So for instance if you wanted to learn more about square roots or means, you could try the code below.
There is also a search engine called R Seek where you can search for code examples or explanations of different functions. This site is very useful because it can be hard searching for R-related questions via Google because the name of the programming language is just one letter.
Most of the time when you’re using R you’ll be working off an Excel spreadsheet you’ve been using to collate your data. Luckily, R allows you to easily import the data into the program. However, rather than importing your Excel spreadsheet directly in, you should convert it into a file type known as CSV (Comma Separated Values). This effectively flattens out any equations etc that you’ve been using in your spreadsheet so you don’t have to worry about R importing your data incorrectly. To save your Excel spreadsheet in this format, open it up in Excel and go to File>Save As. Select the CSV option which should have the file type “.csv”. Once you save it, you’re ready to import your data into R.
Now that you’re going to be importing data directly into R, you’re going to need to set up a working directory. This is the directory (or folder) that R will automatically look for files in. For instance, if you saved your CSV file in a folder called scripts on the desktop the path to this directory (on a windows computer) would be “C:UsersJack SimpsonDesktopscripts”. Notice how in Windows the path is separated by the backslash (“”), for those of you familiar with a Mac or Linux machine, you’ll know that they use a forwardslash (“/”) instead. This is what R elects to go with when referring to directories, so I’ll have to change the path of my working directory to fit in with that: “C:/Users/Jack Simpson/Desktop/Scripts”. Now that this is ready, I can input the set working directory command into R:
In this folder I have placed my CSV file which I can now import into R using the following code:
setwd("C:/Users/Jack Simpson/Desktop/Scripts") bee_data <- read.csv("bees.csv", header = T)
This code tells R to assign to the variable bee_data all of the data in the CSV file “bees.csv”. The “header = T” part tells R that each column has a header on top of it (in this case the two headers are “Site” and “Hives” – I would recommend you take a quick look at the CSV file I’ve provided before opening it in R). If you did not have headers for each column you would replace the “T” with an “F”.
So now we have our CSV file in R, its often worth doing a couple of short checks to see that everything has worked out alright:
setwd("C:/Users/Jack Simpson/Desktop/Scripts") bee_data <- read.csv("bees.csv", header = T) head(bee_data, n=9) tail(bee_data, n=9) names(bee_data) # names of variables str(bee_data) # Structure of data summary(bee_data)
The “head” function instructs R to display the values for the first few rows of each column in the dataset. You can leave out the comma and “n=9” part if you wish and it should show a default number of rows, or alternatively you can instruct it to show more or less than 9. The “summary” function gives you a broad overview of the data such as the mean, median etc.
Now say I wanted to refer a column of data specifically, you can do this by stating the variable the dataset is named under (in this case “bee_data”), followed by a dollar sign (“$”) and then the title of the column. So if I wanted to see all the values in my “Hives” column I would input this:
It is worth noting here again that R is a case-sensative language and as the title of my column “Hives” started with a capital, I had to ensure that I continued to use a capital when referring to it in this way, otherwise R would not understand what I was pointing to.
Sometimes you might end up with missing data for a row and you may have to write in “NA” for that value. There is a command in R to explicitly remove these values, for instance if you were trying to calculate the mean:
mean(bee_data$Hives, na.rm = TRUE)
You can break up the data from a table into new tables with the “tables” command. Although its not really useful here, in a very large dataset this could be handy:
split_data <- table(bee_data$Site, bee_data$Hive) # create new table
You now have a variable (“split_data”) which just has the values from the two columns specified above.
If I wanted to compare the number of hives at the two sites in my dataset (Brisbane and Canberra), I could split them into their own individual subsets of data with the following code:
brisbane <- subset(bee_data, bee_data$Site == "Brisbane") canberra <- subset(bee_data, bee_data$Site == "Canberra")
Now the variable brisbane only refers to the number of hives counted at that site and the variable canberra only refers to the number of hives counted in Canberra. What the above code did was use the dollar operator to refer to a column specifically (in this case the “Site” column) and the “==” is known as a comparison operator – add only the data to this variable which has the exact word “Brisbane” in this column. R has a number of comparison operators (==, !=, <, >, <=, >=). These are (in order): equivalent to, not equivalent to, less than, greater than, less than or equal to, greater than or equal to.
Now we’ve separated out the Brisbane and Canberra hives we can compare them by looking at their means or standard deviation. In the next section we’ll actually perform some proper statistical tests on this data.
mean(brisbane$Hive) mean(canberra$Hive) sd(brisbane$Hive) sd(canberra$Hive)
Here are few simple statistical tests that you can use R for, in later posts where I can go into more detail I’ll be able to do more advanced statistics.
One sample t-test
According to the Australian Bureau of Statistics, the average number of children in a family is 1.9. Say we didn’t believe this was true and surveyed 25 couples and asked them how many children they had. For this test the mu is 1.9 so the null hypothesis is that the average number of children is 1.9, while the alternative hypothesis is that the average is different (two tailed test).
children <- c(2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2) mean(children) t.test(children, mu=1.9)
The mean for the number of children from the survey found that each couple had on average 1.72 children, but we had to perform the t-test to see if this was a statistically significant difference. With a p-value of 0.06122 it was shown to not be significant.
Now this version of the t-test which only requires the data and the mu makes use of several default options. If we were to write out the code for this test explicitly stating each option it would look like this:
t.test(children,alternative = c("two.sided", "less", "greater"), mu=0, conf.level=0.95)
Here we specify that the test has two tails and has a confidence level of 0.95. As I stated, this is the default setting, however it can be useful to write out all the options explicitly to remind yourself if you ever do not to change one of the parameters. If we wanted to do a one tailed test we could write it out like this:
t.test(children, alternative = "less", mu=1.9)
Two sample t-tests
For this test I’m using to use the bee CSV dataset mentioned earlier in this tutorial to compare the number of hives recorded at sites in Brisbane and Canberra.
Again, just as with the one sample t-test there is a short way to perform the test where you rely on the default assumptions built into R, or you can explicitly state the default options as I have also done below:
t.test(brisbane$Hive, canberra$Hive, alternative = c("two.sided", "less", "greater"), mu=0, paired="FALSE", var.equal = "FALSE", conf.level=0.95)
Notice how the “paired” option has been set to “FALSE” – if you wanted to perform a paired t-test you would set that value to “TRUE” or “T” instead.
Single Factor ANOVA
One of the strengths of R is the ease and control it gives you when creating brilliant graphics to visualise your data.
To create a boxplot you can add as many different sets of data as you want to show in the one graph. For instance the first line of code below will create a graphic with a single boxplot while the second will create a graphic with two:
boxplot(brisbane$Hive) boxplot(brisbane$Hive, canberra$Hive)
Now you’ve seen how easily you can create graphics in R, I’ll give you a taste for how much control R can give you. The first line of the code below will generate a histogram with the default settings while second line will allow you to specify the labels and how the data frequency is broken up – the “break” attribute.
hist(brisbane$Hive) hist(brisbane$Hive, breaks=5, freq=NULL, main='Graph Title', xlab = 'Hive Numbers', ylab = 'Frequency')
To generate a simple scatter plot you just need to use the “plot” function:
However, lets customise things a little. We’ll add our labels to the graph and give it some colour:
plot(brisbane$Hive, main='Hives by site', xlab = 'Hive Number', ylab = 'Number of hives', pch=2, col = 'blue', lwd = 1, cex = 1.5)
Here is what the new parameters do:
- pch: Shape of each point – 0 makes them squares but there are multiple different shapes you can set with numbers.
- col: Sets the colours of the points.
- lwd: How thick the lines drawing out each point are, the higher the number the thicker the line.
- cex: How big you want the symbol for each point to be.
We can also add additional points and lines to the scatterplot once it has been created too:
plot(brisbane$Hive, main='Hives by site', xlab = 'Hive Number', ylab = 'Number of hives', pch=0, col = 'blue', lwd = 1, cex = 1.5) points(canberra$Hive, pch=1, col='green') lines(x=c(0,6), y=c(12,12), lwd = 2, lty='dashed', col = 'blue')
The lines section of the code above may look a little confusing. All I’m doing is telling the program how far along the x axis to run the line – from hive number 0 to 6 and how high to start up the y axis at both the beginning and finishing sites for the line. “lty” refers to the type of line – which was set to “dashed” although you can easily set it as something else such as “solid”.
You also do not necessarily have to use just points when creating a plot, you can tell R to use lines instead:
plot(canberra$Hive, type='l',lty='dashed', lwd = 3, col = 'red')
As you can see you can select the type of plot you want manually – by default it is set to ‘p’ for points, although you can change this to an ‘l’ for lines or ‘b’ if you want points and lines.
I’d like to give one more example of a scatterplot which I used to visualise some of the measurements I’d been taking of the width and length of a the body of a parasitic mite I’ve been studying. I ended up with two columns of data, with each row specifying the length and width of each mite. I wanted to compare the measurement distribution of mites parasitising the Asian honeybee (Apis cerana) and the European honeybee (Apis mellifera). The first thing I did was perform a log10 transformation on all the sets of data:
mel_loglen <- log10(melliferadat$Idiosoma_Length) mel_logwid <- log10(melliferadat$Idiosoma_Width) cer_loglen <- log10(ceranadat$Idiosoma_Length) cer_logwid <- log10(ceranadat$Idiosoma_Width)
I then plotted the transformed variables:
plot(mel_loglen, mel_logwid, pch=0, col = 'blue', lwd = 1, main='Varroa jacobsoni body sizes', xlab = 'log10(length)', ylab = 'log10(width)') points(cer_loglen, cer_logwid, pch=1, col = 'red', lwd = 1)
This was the result (note that although I only plotted the data for two populations in the code above, in the example below I also included the dataset from a third population, hence why you can see three different types of points):
Packages are code and data that you can install in R (or comes pre-installed) which you can use no differently than if you had written it yourself. To import data into R, you need to use the “data” command:
You can now use this data as you would any other dataset – for instance you could use the “head” or “summary” functions to examine the data. To view all the packages currently installed on your computer, you need to use the “library” command:
library() # see installed packages on computer
To install a package on your computer you just need to find out the name of the package and you can install it in an R script:
install.packages("seqinr") # installs seqinr package
I wrote this tutorial as much for myself as for anyone else so that I could have an easy resource to access when I needed to remember how to do something in R. While I was writing this guide I drew upon a range of materials that were provided to me in books and during workshops. I’d like to thank Dr David Schoeman (USC), Dr Anthony Richardson (UQ) and Dr Bill Venables (CSIRO) whose R workshops I attended and learned so much from.