This is the 3rd post in the Data Science Process Series. Now we have taken a look at the process holistically, performed the necessary, planning, agreed the architecture and defined the problem from both a business ‘human’ and analytical ‘techy’ mindset, we will move onto the more hands on areas of the process.
Acquiring the Data
There are many different ways in which we can acquire data for an analytics project. Querying a database, calling and API, scraping the internet are to name but a few. From the perspective of the data science project, we can sometimes get lucky when working with a large organisation and have a DBA at hand, ready to help us grab the data and get it where we need it. If this is the case then great, grab a beer, put your feet up and wait for a dataset to be delivered to you. More often this is not the case, and a data scientist will have to use their own coding skills to grab the data from an on-premises database.
Importantly here we need to loop back to the agreements we made previously. Do you remember in the previous post we mentioned putting an agreement in place suggesting how we handle client’s data? This there where we reference that specific agreement. We need to be very careful when taking any clients data off site, there are security concerns with moving data – where possible we can minimise the risk of moving data by agreeing to work on the customers current architecture, obviously if we are performing a proof of concepts, we need to make a judgement call on whether its safe to port files around as CSV/Parquet/Excel.
So, we have a pile of customer data. The next question I would ask at this point: Is there any way in which we can enrich the dataset by adding data from external sources? These may have been mentioned earlier on in the definition phases of the project.
Let’s take the example from our previous posts, our Property Developer problem. We have extracted all of Property Developer X’s historical sales data from their database, we have also pulled sales data for the local area directly from the UK land registry’s API as outlined in our definition discussions. What else could we add in to enrich the dataset? Its widely accepted that a properties proximity to a tube or rail station will impact the price of a property, so lets find a list of London tube stations with their Longitude/Latitude locations, we can save this for later use when we are understanding and modelling our data.
Understanding the Data
Once we have the data we need to dig into it to understand more about the data and the correlations between the different features (columns) of the dataset. Sometimes it’s possible to uncover patterns within the data, build questions about the data, and reject or accept the original hypotheses from digging around and performing analysis before we start modelling.
There are plenty of methods we can use to greater understand our data. I have highlighted some of my favourites below, but wont go into any serious level of detail.
The overall message here is to get an understanding of the data so you look smart when presenting back to the business – if you tell the business something they didn’t know – great! You already look like the smartest person in the room.
We can use a correlation matrix to visualise straight off the bat if our metrics are correlated or not. A highly correlated metric is one that’s value appears to be highly impacted by a different value. Be very careful what you infer here – something to remember is that correlation does not imply causation. This means that just because two metrics appear to be related, it does not mean they directly correlate to one another, sometimes we have to dig deeper to find the true cause of a potentially correlated pair of metrics.
Nulls in Data
We can write simple code to look into the number of nulls in a single column, it’s pretty much bread and butter for data cleaning – which we will get into later.
Scatter Plots are really useful when analysing linear relationships. We can use a PairPlot to plot all our variables against each other.
Its widely appreciated that ~70% of the work in any data science project is surrounding the moving and cleaning of data.
I’m not going to linger with the detail here, the area of data preparation is arguably the largest and most intensive in any data science projects – there are many posts surrounding this on both this blog and the internet in general, I would strongly suggest taking a look and learning more about data preparation.
In a large organisation we may be presented with a data engineer who will be responsible for the quality of data, but in the real wold its likely that the data scientist will have to apply a few techniques themselves.
The types of questions we need to ask are:
- Are there any columns in the data that are useless and can be removed? Less data = better performance.
- Are there any columns that contain so many nulls that they are also useless? As a general rule of thumb I remove any feature that has more than 80% nulls – after checking how it impacts the target variable we are trying to predict of course.
- Are there likely any key errors in the data? Most data is typed in my an individual at some point in its lifetime, what are the chances of human error, and what can we do to correct them.
- Are there any outliers that we want to remove from the data, which my decrease the accuracy of the model?
By the end of these two processes, we should have a solid, clean dataset on which we can begin to build a model.