Data Science Day 1 – Machine Learning for the Rest of Us

My last blog post was more than 9 months ago and at this rate, this blog should probably be presumed dead since not even I really recognize it. At any rate, here I am again, attempting to breathe life back into this oft neglected space and what better way to do it than with a series that focuses on all things machine learning.

Over the course of this blog series, I will introduce a broad and diverse range of machine learning topics from the perspective of a practitioner. These topics will range from a focus on various models and techniques such as classification, clustering and recommenders to a discussion about tools, platforms and challenges such as operationalization of machine learning models.

I’ll spare you the typical machine learning is pervasive/disruptive/transformative/{insert descriptive adjective here} pep talk as I highly suspect that if you are reading this you already understand the value of machine learning. Instead, let’s dive right in to the introduction.

Introduction to Predictive Analytics

Predictive analytics in its simplest form involves using past or historical data (called experiences) to predict future events or behavior. There are a variety of statistical, machine learning and data mining techniques that can be used to meet this end all of which we will classify into either supervised or unsupervised learning techniques.

Supervised Learning

In supervised machine learning, we are interested in training a model using past data were the desirable behavior or event such as a purchase or click is known. This data is said to be labelled and the label is sometime referred to as the ground truth or our target. The inputs into the model known as attributes or features will be used to observe potential patterns which can be exploited to predict the target.

This training step is what differentiates supervised from unsupervised techniques. Following model training, it is necessary to test our model on labelled, unseen or held out data. During this step, the model is used to predict our target (i.e. will user buy/click/etc.) on the test data and then that prediction is evaluated against the ground truth to determine how well the model performs.


Within the context of predictive analytics the two classes of techniques that you will encounter in the supervised space: classification and regression. Classification simply means that we are trying to classify things or events into one (binary) or more (multiclass) groups using a technique such as decision trees, logistic regression or neural networks.

Examples of classification activities are:

  1. Binary – Will customer X buy?
  2. Binary – Is transaction fraudulent?
  3. Binary – Is the email spam?
  4. Multiclass – What data plan is best for customer? (Economy/Standard/Premium)

Regression on the other hand is used when the target we are trying to predict is numerical. A technique like linear regression could be used to predict targets such as the call volume for a call center or a product line sales volume forecast.

Unsupervised Learning

Whereas in supervised learning we had to train a model before it is useful, unsupervised techniques have no such requirement. In fact using the most common unsupervised learning technique, called k-means clustering, our data will be divided or grouped into distinct groups based on the distance between points. These groups will all share similar traits which can be exploited for predictive purposes.


This is common in activities such as customer segmentation or in churn analysis. In both of these activities, the groups formed as part of the clustering exercise are used to predict whether a customer will or won’t buy, whether they will upgrade or even whether they will stay.

So, the next logical question you are probably thinking is….how do you choose the most appropriate technique of task X? Well that is a far more nuanced question that requires a lot more than a single post. My hope is that I will help you discover and answer that question for yourself over the course of this series.

Wrap-up and what’s next?

The purpose of this post was to build a foundation and introduce machine learning in the broadest possible sense. By introducing common techniques and familiarizing yourself with the common terminology the stage is set. In the next post we will begin looking at classification techniques for binary and multiclass predictive analytics.

We will start by exploring some of the most common techniques including linear and non-linear classifiers, decision trees among others. This will then become the basis for a more formal discussion of the data science process including data prep/wrangling, feature selection, data partitioning, model selection and model evaluation.

Till next time!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s