I wrote this article for non-technical consultants at my firm to provide guidance on how to interact with a data scientist on a project and have confidence in their outputs.
The goal is to provide some advice for the kinds of checks you should be running to make sure no mistakes have crept into the analysis. I will be following up this post with an article I wrote which provides more of a general overview of the data science workflow for managers.
Checking data & outputs
While you should be doing spot and sense checks of the numbers you are given, the level of detail that you go into will come down to how much you trust the competency of your data scientist. Remember, ultimately you are the one who is going to have to stand in front of the client and stake your reputation on the numbers you present. A lot of tech/analytics people are often so focussed on finishing the analysis and getting you those numbers as quickly as possible that once they get a number they don’t always spend the extra time to check that that number makes sense. I have written a guide below for how to conduct the checks that could be done.
High Level Checks
- Pick a few cases (e.g. rows from your raw dataset) and make sure they end up matched with the correct data in the outputs
- As the consultant, you often have more domain expertise than your data scientist – try to look for edge cases in the data that can be used to test our assumptions about the analysis
- Do a ‘back of the envelope’ calculation to make sure that the numbers you’re given are within reason. If they are not then either:
- Something is wrong with the analysis
- Something is wrong with the raw data
- You’ve found something novel that may be of value
- If it is difficult to come up with a meaningful calculation, e.g. if and the data is too big to sum yourself in Excel, then you should ask the data scientist to do some simple sums of columns or groups in the data.
- Errors need to be traced back to their root cause (e.g. sometimes a major problem in the outputs can be a simple fix like correcting the data type of a column right back at the beginning of the analysis pipeline)
Checks for individual datasets once they are read into Python/R
- Are all the expected columns/rows there? Keep track of those numbers.
- What data types are the columns (e.g. datetime, integer, decimal, etc.)? This is more important than you would think, and an easy way for subtle errors to creep into the analysis.
- Are there missing values in the places we expected? Are there more missing values than we expected?
- Are some rows or values duplicated in a way we did not expect?
- Are there unexpected unique values that could have crept into some columns?
Checks when combining datasets
- Frequently, combining multiple datasets will be a multistage process
- You want to make sure that the rows from your different raw datasets matches up to the same row in the outputs correctly
- Sometimes you may have a single row from one dataset match with multiple from another – ensure this matching process is correct.
- Do the number of rows and columns in your newly combined outputs make sense given the number of rows and columns in the raw data?
- Check that some rows which are meant to be missing after you merge datasets are actually gone
- Check that all the rows that are meant to remain are still there too
- Is it possible that the ID columns you’re combining data on may have missing values? You need to come up with a way to resolve this.