I feel the urge to throw my two-cents in regarding the data science process. There are plenty of competing opinions circulating the industry regarding how we do what we do. I have spent (probably too much) time researching the Data Science Delivery processes recommended by the industry headliners; Amazon, Microsoft, IBM, along with opinions picked up during my career and those of colleagues and bloggers alike. The aim here is to try to take the best bits of all the recommended processes, and squash them together into a series of blog posts that will be easy for anyone to understand.
What is the Data Science Process?
The data science process is the set of rules that we follow in order to successfully deliver a data science project. The data science process needs to encapsulate all the following areas qualities:
- Be robust enough to provide a guideline that is repeatable. We don’t want to have to reinvent the wheel every time we pick up a new project.
- Be dynamic enough to be flexible. The process acts as a guideline – we sometimes have to add/remove/change pieces of it in order to satisfy a requirement or specific project.
- Be iterative – it is science after all, we don’t always get it right the first time!
Ill try to litter the post with diagrams so it doesn’t get boring, like below.
High Level Description
This section contains a high-level description of the data science process that will be broken down across the next few posts, along with a fancy reference diagram.
- Define Business Problem
Here we work closely with business subject matter experts to understand the business problem. What is the issue? What are potential causes? Can we Identify any hypotheses?
- Understand Analytical Problem
Now we understand the problem we need to look at how we plan on solving it using analytics. Classify what type of problem it is, which techniques could potentially be used to solve it.
- Define Technological Architecture
This step relates more to enterprise-level projects, is there already an architecture in place? Is something re-useable required or are we just answering a question?
- Data Acquisition and Understanding
Get the data, then work with customer SME to understand it. What is the quality of the data? Are there any relationships we need to better understand?
- Data Preparation
Put together a plan to clean the data, agree with business, then perform cleaning activities.
- Data Modelling
The fun part – experiment with the data – try out some statistical methods and try to draw some conclusions.
Present conclusions to customer, tie back to any hypotheses.
Make any analytics models and processes available for production use.
Work with customer to understand the success of the project.
The next series of posts will analyse the separate areas of the Data Science Process in Depth.