Best practices for data science with the Jupyter Notebook

by Jack Simpson December 18, 2016

written by Jack Simpson December 18, 2016

I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.

Notebook Types

Use Jupyter notebooks to collaborate and share data/analyses amongst teams and clients. Utilise two types of notebooks:

1. Lab Notebooks

Use to keep a record of things like exploratory analysis etc
You don’t change/update it – its a historical record for each person
Naming notebooks
- [Date]-[Initials]-[2-4 word description].ipynb
- 2016-12-18-JS-iris-dataset-exploration.ipynb
Split when notebook reaches a certain size or by topic
Example notebook:
- Title: purpose of notebook
- What is in the notebook
- What you were trying to achieve/analyse/hypotheses
- Can say whether different analyses worked or were a dead end
- Import libraries, use magics
- Use version_information package to output version numbers of libraries
- Import data
- Can link back to deliverable notebooks you’ve built on – e.g. notebook explaining how data was cleaned

2. Deliverable Notebooks

Notebooks you’ll want to reference in the future
Processing and cleaning raw data: record of transformation
Use as evidence of analysis when making pull requests
Used and shared by entire team

Directory Organisation

data [backed up outside of version control]
deliver [notebooks to deliver/continually use]
develop [lab notebooks]
figures [where your figures are stored]
src [scripts/modules]

Teamwork/Version Control Recommendations

Each data scientists has dev branch that they push to daily
Merge to master via pull request
Commit .ipynb, .py and .html and figures from notebook (saving 3-4 different ways)

Benefits

Record complete analysis, including dead ends so its easy for others to review
Managers can see analysis in notebooks on GitHub or with HTML output.

Jupyter Python