Oracle Cloud Analytics Services

The purpose of this experiment is to investigate the application of the data science process within industry leading cloud-based analytics tools.    

The goal of the experiment is to evaluate the possibility of completing an accurate Machine Learning experiment within Oracle Analytics Cloud Services, without inputting code.   The ability to complete this task provides an indication as to how easily an end user with little or no coding knowledge would be able to pick up the tool and apply machine learning techniques to their data.

Hypothesis

The experiment attempts to provide evidence to support or disprove the following hypothesis:  

“I believe a user with minimal experience in data science and machine learning can create the products of a simple end-to-end Data Science process without writing any complex code, directly in Oracle’s cloud Analytics Tool, OACS – within half the time taken to hand code in Python”

Any final analysis will link directly back to this statement.

Why?

Many of the industry leading BI Software vendors have adapted their data visualisation tools to gear towards the industry trend of utilising advanced analytics and data science techniques, enabling customers to gain insight from their data that was never possible in the past. 

 Many of the BI-Leaders’ cloud-based offerings are now seeking to “Bridge the Gap” between BI – traditional Business intelligence and AI – Machine Learning and Advanced Analytics.  This experiment attempted to prove the concept for OACS.


The Experiment

Baseline

The baseline for the experiment will be a code script engineered in Python.  The experimental data sets are extracted from a Kaggle machine learning competition designed to test a data scientists ability to predict house prices using Machine Learning Models and data science techniques. 


Here is the experiment in Python – as a downloadable HTML file. Still working on getting the GitHub integration working on this blog.



Training and Test Datasets are available from Kaggle

Scoring

Models trained using both Python and Cloud-Analytics Technologies will be scored independently by Kaggle.  For further information about the competition including data descriptions and specifics are available here.

The generated predictions on the test dataset will be scored using Root Mean Squared Error.

More info on Root Mean Squared Deviation

Technology Considerations

  • The Baseline is coded in Python 3.6 in a Jupyter NoteBook.
  • The Cloud Analytics Tool is Oracle Cloud Analytics Services v18.



Execution Steps – OACS

Data Exploration

The first stage of the experiment was to undertake some analysis to find out more about the data set that had been presented.

Correlation Matrix

The first step was to enable the understanding of the dataset as a whole.  A solid place to start with this is by creating a Correlation Matrix – to aid understanding of correlation between metrics.

Identify Relationships and Outliers

Next the highly correlated metrics were identified and plotted as Individual analyses to understand more about them; the key goal to present any linear relationships and identify any outliers in the dataset. 

One feature that was especially useful here was the ability to quickly select a datapoint and change the colour of it within the scatterplot.  This made it easy to highlight the outliers in the leftmost scatterplot (in red).   

Oracle DV’s annotation feature was similarly helpful here, especially if the analysis was to be presented at a later date.

Interrogate Target Variable

The next step in the data science process was to learn more about our target variable, this was very straightforward, the dataset was opened in Data Visualiser, our metric was identified and right clicked.  “Explain” was selected.   OACS did all the hard work here.  OACS underlying Machine Learning Algorithms went away and though about all of the relationships that may be prominent, then automatically renders a whole load of analyses that can be interrogated.

It was possible to select the individual analyses and drop them onto a DV canvas to further dig into them. 

This functionality is exceptional, it removed a significant amount of time in the process and made it simple to identify patterns in the dataset at the click of a button.



Distribution of Target Variable

Having taken the analysis that was automatically presented by OACS and added to a canvas,  minor alterations were made to the chart to alter the granularity on the X-axis,  to more accurately visualise the distribution of the data in the target variable.

Within a few clicks the leftmost chart (below) was presented.  It’s clear from this visualization that the data is left skewed.   A custom metric was created applying LOG() to the variable to make the graph appear as per the right chart – a clean(ish) bell shape.


Side Note:  Could not find a way to code the identification of nulls across a whole dataset in OACS.  In python we can take a piece of code and write a FOR LOOP iterating the code over all the columns in the dataset.   In OACS we can add our own python code, but this contradicts the hypothesis. 

 This will need to be investigated further.  Currently user would need to write individual logic to identify nulls in each column.  This would take a significant amount of time.


Data Preparation with Oracle Data Flows

The next phase was to prepare the data so that it could be used to train a model.  The below data flow was created to perform the previously identified actions to the data. 

  • Remove Outliers that were previously identified
  • Add a calculated column that could be useful (Total Surface Area), which was created by adding other columns in the dataset together.
  • Add log() column into the dataset (if not completed in the previous step)
  • Remove a count column that was useless for prediction.

The data set was saved (the last node in the DataFlow here).  It was also possible to change the datatypes of the output data set here.   That’s useful as this means that we don’t need to write code to engineer these features.

There was something automatic that happens with nulls when we convert between text and number, requires further investigation.

Imputer for Nulls and NAs

Due to lack of ability to iterate over columns, the NAs in this dataset were not imputed individually as this would have required an individual case statement across each column.  Instead the selected algorithm had an inputter function; This allowed the user to select imputer method for categorical or continuous variables.  The restriction here was that the user had to use the same inpute method for each all columns of the same type.  i.e. it wasn’t possible to have NAs in continuous variable A as mean and NAs in continuous variable B as median.

Model Selection and Training

There were 4 different models available in the “numeric prediction” model category.  

Note: we can add our own python scripts containing custom models, this functionality will be evaluated later.

For the purpose of this experiment – “Linear Regression for Model Training” was selected, this offered a Lasso regression method, which matches the python code that was created as a baseline.

This area in OACS requires some knowledge on how to select an ML model – If OACS is attempting to truly make data science available to everyone, there should be a help link containing insight into how user selects their specific model, or at least a high-level description on what each model, or group of models is generally good/bad at.   Our trained model was saved, it lives in the “Machine Learning” section of OACS, available from the home menu.

Apply Model

Once we have our trained model we now need to apply this to our test data set.   The test data set is uploaded into OACS and saved as a dataset.   We then create another data flow in which we select ‘apply model’.   Once we select ‘apply model’ we are presented with a list of Models – in which the model we just trained is populated.  We apply our new model and select the name of the column that hold the prediction.

Tidy Up and Submit

We are left with a dataset that contains just an ID and a submission.



Results

Experiment Accuracy Score (Low is Good) Time Taken (Days)
Python Code 0.12429 3
OACS 0.12895 1

The table illustrates the two linear regression experiments were very close in accuracy.  Python code just edging out OACS for accuracy, however OACS took significantly less time to get there.

In tying back to the hypothesis:

“I believe a user with minimal experience in data science and machine learning can create the products of a simple end-to-end Data Science process without writing any complex code, directly in Oracle’s cloud Analytics Tool, OACS – within half the time taken to hand code in Python”

  • No custom code was added into OACS to create the artefacts required
  • No significant training in machine learning would be required to enable a user to follow the steps, just a basic understanding of how a generic data science process works, for example which order to complete tasks in and what different models do.
  • All work was undertaken directly in OACS.
  • It took less than half the time to reproduce in OACS.

I believe this to be proven and the experiment a success, albeit with some caveats that will require further investigation. These are listed below.


Notable Points for Further investigation

Collaborating Online

This goes hand in hand with the ability to add our own scripts into the product.   PowerBI has a great platform in which users can create visuals in R and share them amongst the online community, once a user has found a visual online they can point and click to add it into their instance.  Something like this – an ability to collaborate and share code online would really make the tool more appealing.  I wonder Is this why Oracle have recently acquired DataScience.com? 

Adding our own scripts into OACS

This functionality requires a working knowledge of Python and XML, it is certainly not for a user with minimal ability so is excluded from the experiment.   This functionality requires that the user defines a function in python (or R) and puts it in a generic XML wrapper before uploading.  

What would be very appealing is the product improved to remove the need for the XML wrapper, that way users could take others code from other sites and try it out in the OACS environment efficiently.

Stacking Models

It seems this is possible, would require multiple data flows and a number of data sets, but no real problem, it can greatly improve the accuracy and efficiency of a model.

Iterating over Columns

As mentioned previously, if the product could take a piece of simple OACS code (e.g. a CASE statement), then run it iterating over each or a selected group of columns in a dataset, this would save huge amounts of time. 

Sequences

All the data flows mentioned in the experiment can be put together in sequences and ran back to back, to create one “ETL” style pipeline.  The experiment did not touch on this functionality.

Click To Tweet