Tag:

R

Essential Libraries for Data Science on a Mac

by Jack Simpson December 28, 2016

written by Jack Simpson

I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.

The Python libraries I’ll frequently pip3 install include:

December 28, 2016 0 comments

Content Programming

Multiple plots in figures with R

by Jack Simpson July 4, 2016

written by Jack Simpson

The other day I was working with R in a Jupyter Notebook when I discovered that I needed to include multiple figures in the same plot.

Surprisingly, R doesn’t include this capability out of the box, so I went searching and found this function that does the job. I’ve included the code below for my own future reference in case the linked site ever disappears.

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Then to call the function, you just have to pass it the plots:

multiplot(p1, p2, p3, p4, cols=2)

July 4, 2016 0 comments

Content Data Science Programming Tips & Tutorials

Never trust your factors

by Jack Simpson January 6, 2016

written by Jack Simpson

I recently helped a friend out with a dataset – she was struggling to merge the CSV files from two dataframes in R into one dataframe. I thought this would be quite simple and yet could not get it to work with merge or dplyr – it just kept giving me weird results. The problem was that I was too trusting of the data that was input by human hand. Here’s what happened when I started to critically interrogate the data.

Firstly, I read in the CSV files – so far so good:

tree = read.csv("TreeData.csv", header = T)
full = read.csv("FullAnalysisRawData.csv", header = T)

Next I used the View command to see what the dataframes looked like in RStudio:

View(tree)
View(full)

Now in this particular dataset, I wanted to cbind (add the columns) of the rows with the same TreeID and Sample.type from the different CSV files. The first error in the data I noticed was that there was a duplicated record with the ID ‘B01(2)’ – so this was how I got rid of that row:

full_rm_dup <- full[!(full$TreeID == 'B01(2)'),]

Next, I noticed that the levels for the Sample.type factor were capitalised in one CSV file and not the other. The easiest way to fix this was to rename the factors in one of the files:

 
levels(tree$Sample.type) <- c("Bark", "Leaf")

Finally, I wrote a function that would go through each row in the first dataframe and look at the value for TreeID and Sample.type. It would then look for rows in the other dataframe that matched these two values. I decided to print out the resulting values from this like so:

merge_dfs <- function(each_row) {
 full_tree_id <- as.character(each_row[1])
 full_sample_type <- as.character(each_row[3])
 matching_row <- tree[(tree$TreeID == full_tree_id & tree$Sample.type == full_sample_type),]
 print(paste(full_tree_id, matching_row$TreeID, full_sample_type, matching_row$Sample.type))
}

apply( full, 1, function(each_row) merge_dfs(each_row))

The apply function in R applied my function to every row (that’s what the number 1 did, if I’d wanted to apply it to every column I would have used 2 for this argument instead). When I looked at the output from this function, I saw that there were quite a few rows where the TreeID of one CSV file did not match of the TreeIDs in the other file. I emailed all this to the researcher who now knew what was wrong with the dataset and could fix the mistakes that had occurred. With that done, the original merge function she was using worked perfectly.

Moral of the story: never trust that the factors in your dataset are correct – capitalisations, duplications and input mistakes can occur really easily and can be quite subtle in large datasets.

January 6, 2016 0 comments

Content Data Science Machine Learning Programming

Interesting Readings

by Jack Simpson July 28, 2015

written by Jack Simpson

Work has kept me pretty busy lately but I’ve been meaning to try put together another post with some of the interesting readings I’ve come across. The first thing I’ll mention is that the IEEE (Institute of Electrical and Electronics Engineers) have released their rankings for programming language popularity. Python (ranked #4) and R (ranked #6) have continued to increase in popularity while Java remains at the top which is to be expected given how popular it is in industry. Below are some of the other articles I’ve found:

Big Data

Python

How We Deploy Python Code
Want to understand Python’s comprehensions? Think in Excel or SQL.

Version Control

A gentle guide to Git and GitHub

July 28, 2015 0 comments

Content Programming

Parallel operations in R

by Jack Simpson October 25, 2014

written by Jack Simpson

I thought I’d start a list of some code examples I’ve found online which enable you to perform parallel operations in R and take advantage of multi-core processors.

I’ll try to add to this list from time-to-time when I come across new examples.

October 25, 2014 0 comments

Content Links Programming Resources

Where to find R tutorials

by Jack Simpson April 20, 2014

written by Jack Simpson

Here’s a list of a few R tutorials (in addition to the one I wrote), which I’ve found (or look) rather useful:

Google Developers R Tutorials

Playlist of R tutorials on YoTtube

A slightly Different Introduction to R

April 20, 2014 0 comments

Content Links Programming Resources

Useful links for using ggplot2 in R

by Jack Simpson April 19, 2014

written by Jack Simpson

The R package ggplot2 is one of the best data visualisation tools I’ve come across, and while it simplifies generating impressive graphics, there’s still a bit you have to learn to use it. Here are a few of the posts I’ve found really handy when using this package:

I’ll try to continue updating this list as I find other good resources.

April 19, 2014 0 comments

Content Programming Tips & Tutorials

R Tutorial

by Jack Simpson May 9, 2013

written by Jack Simpson

This tutorial is a beginners guide for getting started with R, once you complete it you should have R installed on your computer and be able to import data, perform basic statistical tests and create graphics.

Index

Getting started
Basics
Importing Data
Tests
Graphics
Packages

Getting Started

The first things you will have to do is download R and install it on your computer. To do this you’ll need to visit a CRAN (Comprehensive R Archive Network) repository. There are a number of sites you can find easily by searching, however here in Australia it is hosted by the CSIRO here. When you visit the site you’ll be asked to click on the link to the R version for your computer (Linux, Mac, Windows). Once you do so, you can then proceed to download the software (although for Windows users make sure you select the base version of R to install).

Once R is installed, you’re ready to get going, although I would recommend installing one other piece of software before proceeding – RStudio which may be found here. RStudio is a fantastic development environment for writing R scripts and is especially useful for beginners.

May 9, 2013 0 comments

Content Programming Tips & Tutorials

Running an R script through Python

by Jack Simpson February 3, 2013

written by Jack Simpson

Before I made the switch to developing on a Linux machine, I noticed that the Python module for calling R (RPy2) seemed to be having some problems on Windows. This gave me an excuse to play around with writing my own Python script to create and run an R script. As you’ll see in the code below, I’ve used the subprocess module to execute the R script I created and then pipe the results back into the terminal.

import subprocess

# rscript is the content of the R file we'll create
rscript = '''#!/usr/bin/Rscript

cat('This is a simple program') # same as c()
NumOfIterations for (i in 1:NumOfIterations) { # 1:10 is range
cat(i, 'Hello world!')
cat(' ')
}
'''
r_file = open("example.r", "w")
r_file.write(rscript)
r_file.close()

r_file_name = "example.r"
ppath = r'C:Program FilesRR-2.15.2bini386Rscript.exe'
proc = subprocess.Popen("%s %s" % (ppath, r_file_name), stdout=subprocess.PIPE)
output = proc.stdout.read()
print output

February 3, 2013 0 comments