Data Science Modelling

Not all Models are Created Equal…

On to the fancy part of the Data Science process. This is what Data Scientists are renowned for: producing a model that takes data, performs some magic, then spits out the correct answer – at least that’s what the papers say. Realistically the Modelling section of this process is completely redundant without the ‘dirty’ work that goes on beforehand.

As per the strap line, all models are not created equal, a huge part of data science is experimentation – once we have our ‘clean’ data set it is relatively easy to pass this through any number of models, then compare the results.

Once you have classified the type of problem you are facing – in the “Understanding Analytical Problem” phase – just try a whole heap of algorithms until you find something that scores well. Note down your findings and reuse the approach next time!

Automated Algorithm Selection – The Future

Currently there is a significant effort from all major players to automate the Machine Learning Algorithm selection process. Example Microsoft AutoML. Its worth considering that the industry players really want to get rid of the manual task of selecting analytics approaches – not so relevant for this post – but something to look out for in the future.

The Project Manager’s Dilemma – Cost vs Accuracy

Something worth considering when choosing a model: Do I want to hand code my model from scratch or use something that is already prebuilt?

With many industry leaders creating ‘black box’ algorithms (meaning that the coding is already done, a data scientist just needs to call the algorithm in the code they are writing), one wonders if there is ever any need for an individual to perform the complex mathematics that has been historically associated with data science and machine learning.

Of course in the research world there is a huge necessity to crunch the numbers manually and produce something exclusive and innovative, but in the commercial world the opinion of many-a-CEO is that a business would rather save £100,000 on resources and get to market with a working solution that ‘does the job’ rather than spend it to improve the accuracy of an algorithm by 0.001%.

High Level Process

This information lifted from Microsoft is pretty handy in explaining the modelling process.

  • Split Data into training and Test Sets.
  • Train the model using the training set we just created. We now have a ‘Trained Model’
  • Remove the variable we are trying to predict from the test set we initially created.
  • Test the trained model using the Test data Set
  • Score by comparing predictions from the model to actual values in the data set.



High Level Modelling and Evaluation Process


Train and Test

When we train a model, we need to split our data into training and testing sets so that we don’t train our model on all the data. Why you ask? because we will need a way to test the performance of the model, which can be achieved by splitting our data set into two. We use one set into order to train our model (or give it instructions on how to operate) and hold a subset of the data back to use later to test the trained model’s effectiveness.

So we feed the training data into a model to train it, then when we test the trained model we must make sure we don’t give it the answer; make the model predict the answer. i.e. Remove the column that contains the value we are trying to predict using the test data set, in our Property development example, this would be the house price we remove from the test data set.

Next, compare the predictions against the actual values to see how well the model performed. Data leakage, also known as bias, occurs when data included in the training set strongly correlates to what were trying to predict. In other words, the training data includes information about what we’re trying to predict.

When using time series data, it’s essential to split the data based on time.


Cross-validate the data

Without getting too low level – a great way to improve model performance is cross-validation.

This method splits the data into subsets of the full dataset to ensure we’re not overfitting a model to one training set. Overfitting means that the model works well only with the data used to train it. This happens when too many data elements are used in model training. A sign of overtraining is when you get a high level of prediction accuracy, such as anything close to 100%. Cross-validation is often used in tandem with the train test split.


Evaluate the Results

Once we have a trained model, a set of predictions from our model, and a set of data containing the original results that we were trying to predict, we can evaluate the model using a mathematical equation. A very popular mathematical function is Mean Squared Error.

The maths is shown below – if you don’t understand this don’t worry – its easy enough to call this using your preferred coding language without having to perform any manual maths whatsoever.

{\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.}
Equation for Mean Squared Error


Mean squared error essentially calculates the average difference between the actual results (Y) and the predicted results (Y^). IThere are loads of these, have a google and look into SciKit Learn Documentation for more options.

Last Page – Deployment and Feedback