How to verify the analysis of your data scientist

by Jack Simpson November 16, 2019

written by Jack Simpson November 16, 2019

Purpose

I wrote this article for non-technical consultants at my firm to provide guidance on how to interact with a data scientist on a project and have confidence in their outputs.

The goal is to provide some advice for the kinds of checks you should be running to make sure no mistakes have crept into the analysis. I will be following up this post with an article I wrote which provides more of a general overview of the data science workflow for managers.

Checking data & outputs

While you should be doing spot and sense checks of the numbers you are given, the level of detail that you go into will come down to how much you trust the competency of your data scientist. Remember, ultimately you are the one who is going to have to stand in front of the client and stake your reputation on the numbers you present. A lot of tech/analytics people are often so focussed on finishing the analysis and getting you those numbers as quickly as possible that once they get a number they don’t always spend the extra time to check that that number makes sense. I have written a guide below for how to conduct the checks that could be done.

High Level Checks

Spot checks

Pick a few cases (e.g. rows from your raw dataset) and make sure they end up matched with the correct data in the outputs
As the consultant, you often have more domain expertise than your data scientist – try to look for edge cases in the data that can be used to test our assumptions about the analysis

Sense checks

Do a ‘back of the envelope’ calculation to make sure that the numbers you’re given are within reason. If they are not then either:
- Something is wrong with the analysis
- Something is wrong with the raw data
- You’ve found something novel that may be of value
If it is difficult to come up with a meaningful calculation, e.g. if and the data is too big to sum yourself in Excel, then you should ask the data scientist to do some simple sums of columns or groups in the data.
Errors need to be traced back to their root cause (e.g. sometimes a major problem in the outputs can be a simple fix like correcting the data type of a column right back at the beginning of the analysis pipeline)

Granular Checks

Checks for individual datasets once they are read into Python/R

Are all the expected columns/rows there? Keep track of those numbers.
What data types are the columns (e.g. datetime, integer, decimal, etc.)? This is more important than you would think, and an easy way for subtle errors to creep into the analysis.
Are there missing values in the places we expected? Are there more missing values than we expected?
Are some rows or values duplicated in a way we did not expect?
Are there unexpected unique values that could have crept into some columns?

Checks when combining datasets

Frequently, combining multiple datasets will be a multistage process
You want to make sure that the rows from your different raw datasets matches up to the same row in the outputs correctly
Sometimes you may have a single row from one dataset match with multiple from another – ensure this matching process is correct.
Do the number of rows and columns in your newly combined outputs make sense given the number of rows and columns in the raw data?
Check that some rows which are meant to be missing after you merge datasets are actually gone
Check that all the rows that are meant to remain are still there too
Is it possible that the ID columns you’re combining data on may have missing values? You need to come up with a way to resolve this.

How to verify the analysis of your data scientist

Sign up to my newsletter

You may also like

Leave a Comment Cancel Reply