Best practices for data science with the Jupyter Notebook

I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.

Notebook Types

Use Jupyter notebooks to collaborate and share data/analyses amongst teams and clients. Utilise two types of notebooks:

1. Lab Notebooks

  • Use to keep a record of things like exploratory analysis etc
  • You don’t change/update it – its a historical record for each person
  • Naming notebooks
    • [Date]-[Initials]-[2-4 word description].ipynb
    • 2016-12-18-JS-iris-dataset-exploration.ipynb
  • Split when notebook reaches a certain size or by topic
  • Example notebook:
    • Title: purpose of notebook
    • What is in the notebook
    • What you were trying to achieve/analyse/hypotheses
    • Can say whether different analyses worked or were a dead end
    • Import libraries, use magics
    • Use version_information package to output version numbers of libraries
    • Import data
    • Can link back to deliverable notebooks you’ve built on – e.g. notebook explaining how data was cleaned

2. Deliverable Notebooks

  • Notebooks you’ll want to reference in the future
  • Processing and cleaning raw data: record of transformation
  • Use as evidence of analysis when making pull requests
  • Used and shared by entire team

Directory Organisation

  • data [backed up outside of version control]
  • deliver [notebooks to deliver/continually use]
  • develop [lab notebooks]
  • figures [where your figures are stored]
  • src [scripts/modules]

Teamwork/Version Control Recommendations

  • Each data scientists has dev branch that they push to daily
  • Merge to master via pull request
  • Commit .ipynb, .py and .html and figures from notebook (saving 3-4 different ways)

Benefits

  • Record complete analysis, including dead ends so its easy for others to review
  • Managers can see analysis in notebooks on GitHub or with HTML output.
The following two tabs change content below.
Computational biology PhD candidate at the Australian National University. I love writing (both articles and software), learning more about the world around us, and beekeeping. I also write for BioSky.co

Latest posts by Jack Simpson (see all)

Comments are closed.