I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.
Notebook Types
Use Jupyter notebooks to collaborate and share data/analyses amongst teams and clients. Utilise two types of notebooks:
1. Lab Notebooks
- Use to keep a record of things like exploratory analysis etc
- You don’t change/update it – its a historical record for each person
- Naming notebooks
- [Date]-[Initials]-[2-4 word description].ipynb
- 2016-12-18-JS-iris-dataset-exploration.ipynb
- Split when notebook reaches a certain size or by topic
- Example notebook:
- Title: purpose of notebook
- What is in the notebook
- What you were trying to achieve/analyse/hypotheses
- Can say whether different analyses worked or were a dead end
- Import libraries, use magics
- Use version_information package to output version numbers of libraries
- Import data
- Can link back to deliverable notebooks you’ve built on – e.g. notebook explaining how data was cleaned
2. Deliverable Notebooks
- Notebooks you’ll want to reference in the future
- Processing and cleaning raw data: record of transformation
- Use as evidence of analysis when making pull requests
- Used and shared by entire team
Directory Organisation
- data [backed up outside of version control]
- deliver [notebooks to deliver/continually use]
- develop [lab notebooks]
- figures [where your figures are stored]
- src [scripts/modules]
Teamwork/Version Control Recommendations
- Each data scientists has dev branch that they push to daily
- Merge to master via pull request
- Commit .ipynb, .py and .html and figures from notebook (saving 3-4 different ways)
Benefits
- Record complete analysis, including dead ends so its easy for others to review
- Managers can see analysis in notebooks on GitHub or with HTML output.