Home Content Best practices for data science with the Jupyter Notebook

Best practices for data science with the Jupyter Notebook

by Jack Simpson

I recently listened to a really interesting talk by Jonathan Whitmore where he discussed the approach his company has to working with data using the Jupyter Notebook. I’d recommend watching it, but I’ve made a brief summary below for my own future reference.

Notebook Types

Use Jupyter notebooks to collaborate and share data/analyses amongst teams and clients. Utilise two types of notebooks:

1. Lab Notebooks

  • Use to keep a record of things like exploratory analysis etc
  • You don’t change/update it – its a historical record for each person
  • Naming notebooks
    • [Date]-[Initials]-[2-4 word description].ipynb
    • 2016-12-18-JS-iris-dataset-exploration.ipynb
  • Split when notebook reaches a certain size or by topic
  • Example notebook:
    • Title: purpose of notebook
    • What is in the notebook
    • What you were trying to achieve/analyse/hypotheses
    • Can say whether different analyses worked or were a dead end
    • Import libraries, use magics
    • Use version_information package to output version numbers of libraries
    • Import data
    • Can link back to deliverable notebooks you’ve built on – e.g. notebook explaining how data was cleaned

2. Deliverable Notebooks

  • Notebooks you’ll want to reference in the future
  • Processing and cleaning raw data: record of transformation
  • Use as evidence of analysis when making pull requests
  • Used and shared by entire team

Directory Organisation

  • data [backed up outside of version control]
  • deliver [notebooks to deliver/continually use]
  • develop [lab notebooks]
  • figures [where your figures are stored]
  • src [scripts/modules]

Teamwork/Version Control Recommendations

  • Each data scientists has dev branch that they push to daily
  • Merge to master via pull request
  • Commit .ipynb, .py and .html and figures from notebook (saving 3-4 different ways)

Benefits

  • Record complete analysis, including dead ends so its easy for others to review
  • Managers can see analysis in notebooks on GitHub or with HTML output.

Sign up to my newsletter

Sign up to receive the latest articles straight to your inbox

You may also like