AI vs Machine Learning: Why Data Volume Matters

on January 15, 2018

So many corporations are rushing to use AI, but do they really need it?

Corporations should not rush into the adoption of AI because it is the latest buzz word, every business decision should be made with intent. Business in essence is a MinMax problem, minimize your cost and maximize your revenue.

What can AI do for your business? What sort of problem are you attempting to solve? These are the questions you should be asking yourself before you attempt to hire a PhD in Mathematics to “implement AI”. I lay out a case here where a Machine Learning algorithm beats Neural Networks (AI). Hiring someone with expertise in both is great but hiring someone with Machine Learning knowledge may be more useful to your business (and probably more affordable).

I present here a simple case study that Machine Learning algorithms will beat Neural Networks given a relatively small data set (4000 rows, 1.5 million data points).

Project Context

The premise of this project is to compare Neural Networks vs Machine Learning algorithms. The data set used is from a previous Kaggle competition. The competition was hosted by Daimler Mercedes Benz; their cars have a huge variety in models and specifications and it takes different times to test all the features and options. In this challenge they wanted Kagglers to predict the time it would take on the test bench for given features and options. This will “contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.”

In other words predict time based on some variables.

Most Kaggle competitions are set up in a way where you have a training data set and a testing data set.

Training data set — Gives you what you are trying to predict and also the variables. In this case the time (which is the ‘y’ column) and also all the other variables (X0,X1,X2… so on). This is the data set used to train your models. For this competition the training data set had 4209 rows and 378 columns.

Training data set Part of training data set

Testing a data set  gives you just the variables so you plug this data set into your model and see what prediction you get. You then get this submission and put it onto Kaggle, where you get a score. The test data set had 4209 rows and 377 columns (missing the predicted time).

The score on Kaggle is produced by some metric that is set in the competition outline. In this instance they used R squared and I will use my Neural Network submission result vs one of the top submissions. R square produces a value between 0 and 1, the closer to 1 it is the better your model (this is an ultra simplified version of what it is).

Neural Network Setup (Technical babble)

This project will mainly consist of using Keras and Tensorflow for the backend, essentially I will use MLPs to predict the ‘y’ values.

I attempted this without any dimensional reduction at first, and later applied PCA & ICA. Then I compared which resulted in a better prediction model (I actually end up using F Regression instead). I applied dimensional reduction because there are so many features and looking at the other top submission mentioned above you can see a lot of them used many dimension reduction techniques before applying it to XGboost model.

Quick note, in this competition there was leaked information in the ID column meaning there was a very high correlation between “y” (time prediction) and “ID”(the index used).

Result

Result

In the post I focus on Public Leaderboard scores between ‘reduced’ and ‘xgboost’. For the other results you can read my project file at the bottom of the post. The top scorer managed a score of 0.5697 compared to my 0.52723. You would think that is fairly close but that is the difference between the top 5% and the bottom 5% of prediction on Kaggle. My model took an average of ~5 minutes on my own computer to train, their model took 30 second to a minute on the tiny server provided by Kaggle. When you build a Neural Network there is a lot of guess work involved, a number of layers, nodes per layer, and what to input into the model. This gets a little easier if someone has already created a neural network for specific purpose previously and is suitable for your use case.

One thing I learnt from this entire process is that you require a lot of data before you should even think about using Neural Networks. In this project we had ~4000 rows and ~380 columns, giving us ~1.5 million data points and it still couldn’t beat simpler machine learning algorithms.

I will concede to one point where Neural Networks beat Machine Learning algorithms: over-fitting. In any model you create, you train your model on a smaller sample size first so that it is scaleable to however many data points you have in the future. This reduces the chance of your model only fitting to your current data, the weakness here is obviously you can’t be the most accurate as you are hiding a lot of data from the model. In Neural Networks the over-fitting issue is much less severe compared to machine learning, as I managed to use around 60% of the ~4000 rows in my model. In normal machine learning algorithms you would normally take around 20% to 25% of your data instead to try fit your model.

I would suggest companies with less data to use machine learning implementation first before even thinking about AI and deep learning. This applies to individuals who are interested in the field as well.

In the end you are trying to a solve a problem, why over complicate it?

About Me

I currently work for a startup in Hong Kong called Accelerate. We provide a platform to those who are interested in breaking into the tech industry through coding boot camps. Hong Kong may seem like a tech hub to others but in reality we are no where near cities like Shanghai, New York, Silicon Valley. Our goal is to empower individuals with the skill and technical knowledge to drive technological change and to empower people’s lives.

Link to project on github

For technical setup and process click here.


  AI
  machine learning