Data Science Day #8 – Model Selection

In our last post, we discussed multiple ways or more accurately metrics that we have at our disposal for evaluating a classification model. While this evaluation process is leading us down the road towards model selection we need to first consider optimizing or tuning our model for the task at hand.

Model Parameters

Over the course of this series we introduced a number (yet not exhaustive list) of different model algorithms that can be used for both binary and multiclass classification task. All of these models require model parameters to do their jobs or function. Some parameters are learned during the training process and do not require any special handling per se while others can be tuned to affect the behavior or the model. These parameters are often referred to as hyperparameters and will be the focus of this post.


My highly uneducated and unofficial guess is that many of you like me haven’t the slightest clue what a hyperparameter is or what impact it will have on a given algorithm. Couple that with the fact that different algorithms have different hyperparameters and its enough to make your head spin. Fear not there is a light at the end of the tunnel and we will arrive there soon. Before we do however, let’s consider some of the various hyperparameters you are likely to encounter.

  • Logistic Regression – Regularization (often referred to as ‘C’)
  • SVM – Kernel, Gamma and Regularization
  • Decision Trees – Depth

Optimal Configuration and Grid Search

To date, we’ve often ignored algorithm hyperparameters by accepting default values and without understanding the inner workings of an algorithm it’s difficult to guess the optimal configuration without some iterative random guess. Enter hyperparameter tuning in the form of a technique called grid search.

The grid search technique is simply a brute-force method where we exhaustively iterate over a collection of possible parameter values to find the optimal configuration for a model. Once all the possible combinations have been considered the “best” model is returned. As you can imagine this is an extremely computationally expensive process and many machine learning libraries have built in implementations of this technique. Conveniently enough the scikit-learn implementation is called GridSearchCV.

To use this technique we need to think through a couple of different things up front. First and most importantly is a scoring metric. Going back to our prior post, the score metric is largely based on the task at hand and if we choose something like Accuracy then GridSearch evaluate all the possible combinations of parameter values returning the “best” model with the highest accuracy score.

Next we need to specific a list or range of possible parameter values that we want to evaluate. For obvious reason these may vary between algorithms and if you are not sure where to start your friendly inter-webs search engine or the documentation for the ML library of your choice is going to be your friend.

With these two pieces in place we can use the code provided below to run our grid search to find the optimal model configuration.

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

parameters = {'kernel':('linear', 'rbf'), 
              'C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
svc = svm.SVC()

clf = GridSearchCV(estimator = svc, 
    param_grid = parameters,
    scoring ='accuracy',
    cv = 5,
    n_jobs =-1), y_train)

print('Best Config: %s' % clf.best_params_)
print('Best Score: %.3f' % clf.best_score_)
Best Config: {'kernel': 'linear', 'C': 1.0}
Best Score: 0.971

The preceding code outputs the optimal hyperparameters configuration as well as the highest score for the best performing estimator. To see how our model performs on unseen data, we could then train this optimal model and score it using our test data as seen below. The results show that in this particular example our model generalizes pretty well and actually has a higher accuracy score against our test dataset.

best = clf.best_estimator_, y_train)

print('Score (Accuracy): %.3f' % best.score( X_test, y_test))
Score (Accuracy): 0.978

Nested Cross Validation

We originally discussed cross validation and more specifically k-fold cross validation in the prior post on model evaluation. We will extend that concept here in the context of model selection when we are interested in comparing one model to another while minimizing bias. Using this technique allows us to work with k-fold cross validation both for tuning our hyperparameters and in the subsequent evaluation of the model. The result being a score (in our case accuracy score) for each k-fold. Let’s look at an example.

from sklearn.cross_validation import cross_val_score

cv_scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)

print('Accuracy: %.3f +/- %.3f' % (np.mean(cv_scores), np.std(cv_scores)))
[ 0.96666667  1.          0.9         0.96666667  1.        ]
Accuracy: 0.967 +/- 0.037

A quick look results using 5 folds, shows that our average model accuracy for the test is 96.7% with a standard deviation of 3.7%. Looking at the detail scores we see that our best estimator was perfect on two folds and has a low score of 90%. If we were evaluating multiple models for example a Logistic Regression or Decision Tree we could compare these scores to determine which model is likely to provide the best results.


In this post we discussed model selection through brute force hyperparameter tuning using grid search and nested cross validation as a means to minimize bias during evaluation. With that we’ve covered the basics within the context of a classification problem. In our next post we will discuss a means for tying all these concepts together in a streamlined and repeatable manner using pipelines.


Till next time!


Data Science Day #6 – Dimensional Reduction

Between fending off bad food and crisscrossing time zones, the wheels have come off this series schedule. At any rate….In the last post or the fifth iteration of this blog series, we looked at several techniques for ranking feature importance and then one method for subsequently doing feature selection using recursive feature elimination (RFE). We used these feature selection methods as a means to address datasets which are highly dimensional thus simplify the data in the hopes that it improves our machine learning model. In this post, we will wrap up the data preparation by looking at another way for us to handle highly dimensional data through a process known as either feature extraction or more accurately dimensional reduction.

Feature Selection vs Dimensional Reduction

Before diving in, it’s worth spending a few moments to define at least some context. We spent the prior post discussing feature selection and for many this probably seems relatively intuitive. If we think in terms of a dataset of data with many columns and rows, feature selection is all about reducing the number of columns by simply removing them from the dataset. In this case the data is lost in effect since it is not included when we go through our model training process. We attempt to minimize this data lost by selecting the most meaningful features as we saw previously.

Dimensional Reduction on the other hand is more of a compression technique so to speak. Specifically it seeks to compress the data without losing information. Continuing with our analogy we reduce the number of features or columns in our data by combining or projecting one or more of the columns into a single new column. This in essence is a transformation being applied to your data and as we will see shortly there are two common methods used to accomplish this.

Principal Component Analysis (PCA)

PCA is a unsupervised reduction technique that tries to maximize variances along new axes within a dataset. Don’t worry if your head is spinning and the math nerds around you are giggling. Let’s look at an example to better illustrate what PCA does and how it works. We will look at both a linear and non-linear data sets illustrated below.

Since we are interested in doing classification on these data sets to predict class and most methods look to linearly separate data for this purpose the ideal outcome of any technique applied to the data would be to enhance the divisions of the data to make it more readily separable. Now obviously these are neither big nor particularly complex in nature but let’s see if we can use PCA to reshape our datasets.

Linear Data Set
from sklearn.datasets import make_classification

X, y = make_classification(n_features=6, n_redundant=0, n_informative=6,
                    random_state=1, n_clusters_per_class=1, n_classes=3)

plt.scatter(X[y==0, 0], X[y==0, 1], color='red', alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', alpha=0.5)
plt.scatter(X[y==2, 0], X[y==2, 1], color='green', alpha=0.5)


Non-Linear Data Set
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)


plt.scatter(X[y==0, 0], X[y==0, 1], color='red', alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', alpha=0.5)
plt.title('Concentric circles')


Beginning with the linear example, we created a “highly” dimensional set with six meaningful features. The initial plot is little bit of a mess and so we will apply PCA to the data set and reduce the dimensionality to just two features. The code sample provided uses scikit-learn and then plots the results. As you can see we now have a much simpler data set that is clearly easier to separate.

from sklearn.decomposition import PCA

pca = PCA(n_components=2) 
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[y==0, 0], X_pca[y==0, 1], color='red', alpha=0.5)
plt.scatter(X_pca[y==1, 0], X_pca[y==1, 1], color='blue', alpha=0.5)
plt.scatter(X_pca[y==2, 0], X_pca[y==2, 1], color='green', alpha=0.5)

plt.title('Linear PCA')


The next example is more challenging and is representative of a non-linear dataset. The concentric circles are not linear separable and while we can apply a non-linear technique we can also use PCA. When dealing with non-linear data, if we just apply standard PCA, the results are interesting but still not linearly separable. Instead, we will use a special form of PCA that leverages the same kernel trick we saw applied to support vector machines way back in the second post. This implementation is called KernelPCA in scikit-learn and the code and resulting plot is provided below.

Linear Techniques against Non-Linear Data
from sklearn.decomposition import PCA

pca = PCA(n_components=2) 
X_pca = pca.fit_transform(X)

plt.scatter(X[y==0, 0], np.zeros((500,1))+0.1, color='red', alpha=0.5)
plt.scatter(X[y==1, 0], np.zeros((500,1))-0.1, color='blue', alpha=0.5)
plt.title('Linear PCA on Non-Linear Data')


Kernel PCA on Non-Linear Data
from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)

plt.scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], color='red', alpha=0.5)
plt.scatter(X_kpca[y==1, 0], X_kpca[y==1, 1], color='blue', alpha=0.5)

plt.title('Kernel PCA')


Note that we’ve transformed a non-linear data set into a linear dataset and can now use linear-based machine learning techniques like logistic regression, support vector machines or naïve bayes.

The one point we haven’t discussed is the number of components we should target during the reduction process. The samples as you’ve probably caught on somewhat arbitrarily use two even though in the case of the linear dataset we have six meaningful features. We can visually plot out the amount of explain variance using PCA as seen in the code below. Using this code and the resulting plot allows us to select a meaningful number that explains the maximum amount of variance within our dataset.

from sklearn.decomposition import PCA
pca = PCA() 

X_pca = pca.fit_transform(X)

plt.figure(1, figsize=(4, 3))
plt.plot(pca.explained_variance_, linewidth=2)
plt.ylabel('explained variance')


Linear Discriminant Analysis (LDA)

LDA unlike PCA discussed in the previous section, is a supervised method to that seeks to optimize class separability. Since it is a supervised method for dimensional reduction, it uses the class labels during the training step to do its optimization. Note that LDA is a linear method and that it cannot (or more accurately…should not) be applied to non-linear datasets. That being said let’s look at a code example as applied to our linear dataset.

from sklearn.lda import LDA 

lda = LDA(n_components=2) 
X_lda = lda.fit_transform(X,y)

plt.scatter(X_lda[y==0, 0], X_lda[y==0, 1], color='red', alpha=0.5)
plt.scatter(X_lda[y==1, 0], X_lda[y==1, 1], color='blue', alpha=0.5)
plt.scatter(X_lda[y==2, 0], X_lda[y==2, 1], color='green', alpha=0.5)

plt.title('Linear LDA')


In the preceding code, we used our sample data to train the model and the transform or reduce our feature set down from six to two. Both were accomplished in the fit_transform step. When we plot the result we can observe at first glance that using the LDA method has resulted in better linear separability among the three classes of data.


In this post, we looked at using Principal Component Analysis and Linear Discriminant Analysis for reducing the dimensionality of both linear and non-linear data. The one question that we left unanswered however is given the two methods, which method is better? Unfortunately as you might have guess “better” is relative and there is not a clear answer. As with most things in this space they both are varyingly useful in various situations. In fact sometimes they are even used together. I’d invite you to discover both of these more in depth if you are so inclined.

In our next post on this random walk through machine learning, we are going to leave behind data preparation and move on to modeling or more accurately model selection. We will look at how models are evaluated and some of the common measures you can use to determine how well your selected model performs.

Till next time!


Data Science Day #4 – Data Wrangling

In this preceding post we spent a fairly considerable about of time to understand both supervised and unsupervised machine learning methods for doing predictions or classifications. With that context, we are going to loop back and begin working our way through a “typical” machine learning project.

Project Framework

Like most other technology projects there is a common framework for tackling machine projects that we can use to guide us from start to finish. Known as the CRISP-DM or Cross-Industry Standard Process for Data Mining, this iterative framework sets up guide posts that we will use.


While it is beyond the scope of this post to discuss each step in detail, we can summarize the broad steps.

  • Business Understanding – understand the project objectives and define business requirements. This is the step where we outline the problem definition.
  • Data Understanding – initial data collection, data familiarization, profiling and identification of data quality problems. Typically this is the most time consuming step with some estimates saying 90% of project time is spent here.
  • Data Preparation – this is where we do our data wrangling or data preparation and transformation. We want to identify not only the data we will use but also the relevant features for the task at hand.
  • Modeling – model selection and calibration including model tuning/calibration
  • Evaluation – evaluate or measure model results against the stated business objectives
  • Deployment – deploy and operationalize model including maintenance plans for the required care and feeding

Note that this process is highly iterative. Feedback is important from step to step and the process is evolutionary in that feedback is often reincorporated in previous steps meaning that where you start is likely to not be where you end up.

Business & Data Understanding

We are not going to spend any real time discussing these points beyond saying there are a plethora of tools and techniques that you can leverage on the technical side. If you spent time doing traditional ETL or business intelligence types of task then you have inevitably encounter data quality issues which must be corrected before moving forward.

Data Wrangling

The remainder of this post will focus on data preparation or wrangling the data into a state that it can be used for machine learning. Specifically we will focus in on five of the most common task encountered: handling missing data, dealing with outliers, handling categorical or non-numeric data, binning or bucketing numeric data and finally data standardization or normalization.

Handling Missing Data

Data is not only often dirty, it is often incomplete. Missing data elements in your dataset can appear as NULL values, blanks or other placeholders and not only happen regularly in the real-world they are often beyond your control. Compounding the problem these missing values are typical incompatible with machine learning models or can lead to suboptimal results. So what can we do when we encounter this scenario?

The first and most obvious answer is that the data can be removed. When we talk data removal, this can mean eliminating the offending row or record of data or eliminating the entire feature (column) for the data set. But removing this data comes at the cost of information loss and can be particular problematic in smaller datasets where every observation or record is valuable.

So we need an alternative and those comes in the form of what’s known as imputed values. As the name implies, this technique for handling missing values allows us to substitute the missing value with a meaningful value. This meaningful value is generated using a strategy that is typically based on one of the measures of central tendency (mean, median, mode-most common value).

We can demonstrate using this technique using the sample provided straight-out of the scikit-learn documentation. Note that there are equivalent features in nearly every toolset (R, AzureML, etc) if you are not using Python and scikit-learn.

In the sample code, below note that we create out imputer using the mean strategy and identify what constitutes a missing value (NULLs are represented as NaN in Python). Next we go through a fit process where the imputer learns the mean or whatever strategy you chose for the data set. The transform call on the imputer substitutes out the missing values in the dataset. The sample code and output is provided below.

import numpy as np
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)[[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]


The results of running the code above are displayed below.

Detecting and Handling Outliers

Outliers in your dataset are observations or points that are extreme. Outliers can be valid (think comparing Bill Gates net worth to the net worth of people in your surround Postal Code) or invalid points (occurring as a result of measurement error) and have the potential to skew your machine learn model leading to poor or suboptimal results.

The first step in dealing with outliers is being able to identify them and the first place to start is through looking for values which fall outside the possible range of values. Examples of these can include negative values where they are not expected (i.e. negative values for height/weight of a person), unrealistic numbers (such as infant weights great than 100 lbs.) or other values which shouldn’t or can’t occur. This is what I call the smell test and can be identified using basic summary statistics (Minimum, Maximum, etc.).

Next, if your data follows a normal distribution, you can potentially use the median and standard deviations following 68-95-99.7 rule to identify outliers. Note that this is not always considering statistically sound since it is highly dependent on the distribution of data (meaning you should proceed with the utmost care and caution).

Finally, since we are discussing machine learning, you could guess that there machine learning techniques which allow us to identify both inliers and outliers. Out of the box scikit-learn has a couple of options for support for outlier detection ( You could also resort to using a model such as linear regression to fit a line to your data and then remove some percentage of the data based on the error (predicted value vs actual).

The task at hand determines how we handle outliers. Some models are sensitive to outliers and leaving them in place can lead to skew and ultimately bad predictions (this is common in linear regression). Other models are more robust and can handle outliers meaning it is not necessary to remove them.

Handling Categorical Data

Many machine learning models have specific requirements when it comes to feature data. Most machine learning models are not capable of handling categorical data and instead require all data to be numeric. These cases require that we convert our data..

Categorical data consist of two types of data and we first must determine what variety of categorical data we are dealing with. Ordered categorical data or ordinal data has an implicit order (i.e. clothing sizes, customer ratings) and in many case simply substituting the order sequence number for the category label will suffice. In non-ordered data this won’t work since some models may inappropriate derive a relationship based on the numeric order.

In this situation we use a process to encode the categories as a series of indicators or dummy variables. These dummy variables are simply binary fields and function as a pivoted set of categories. To illustrate this consider a feature that has three possible categories: red, blue and orange.

When encoded as dummy variables, our dataset would have three new features, is_red, is_blue and is_orange, where each features values is either 0 or 1 based on the sample categorical value. To create these dummy features in Python we can use the Pandas library. In the sample code below the get_dummies function handles the pivot and transforms our dataset in a single step.

import pandas as pd

data = ['red', 'blue', 'orange']



Binning and Bucketing Numerical Data

Another case that in which we may want to transform our data involves situations where our data is numeric and/or continuous in nature. In many case we may choose to bin or bucket it either because it is required or is necessary to optimize our model.

This process is referred to by many names including discretizing, binning or bucketing and involves converting our data into a discrete or fixed number of possible values. While there are multiple was to accomplish this ranging from fixed bucket to using quartiles we will look at one example using a fictitious customer annual income feature.

import pandas as pd

income = [10000, 35000, 65000, 95000, 1500000, 5000000]
bins = [0, 30000, 75000, 250000, 1000000000]
names = ['Low', 'Middle', 'High', 'Baller']

pd.cut(income, bins, labels=names)

The above code contains an array of incomes ranges from $10,000 to $5,000,000. We’ve defined bins and labels and then used the Pandas cut function to slot each value into the appropriate bucket. The results are displayed below.


Data Standardization and Normalization

The final data wrangling technique we will look at involves bringing features within out dataset to common scale. But why is this necessary? Often times the features in our dataset includes data with a diverse set of scales, for example sales quantity in unit and sales revenue in dollars. If these features do not share a common scale many machine learning models may weigh the largest feature more heavily skewing the model and given potentially inaccurate results.

Two common techniques for these include standardization and normalization and the difference between the two is subtle. Both techniques will bring features to a common scale, using standardization we rescale all values to new scale between 0 and 1. This technique is useful when we are trying to measure distance between points such as when doing k-means clustering.

The second technique, normalization scales data by shifting it to have a mean of 0 and a standard deviation of 1. This technique will preserves the distributed of data when it is important.

To demonstrate these techniques we can use the iris dataset from the prior post and the preprocessing functions in scikit-learn and compare the minimum and maximum values before and after the transformations are applied.

from sklearn.datasets import load_iris
from sklearn import preprocessing

iris = load_iris()
x =

normalized_x = preprocessing.normalize(x)
standardized_x = preprocessing.scale(x)

print("Original Min %.3f - Max %.3f" % (x.min(), x.max()))
print("Normalize Min %.3f - Max %.3f" % (
	normalized_x.min(), normalized_x.max()))
print ("Standardized Min %.3f - Max %.3f" % (
	standardized_x.min(), standardized_x.max()))

Note that in the resulting screen snip the normalized values all now all between 0 and 1 and the standardized values are now center on 0.



In this post, we explored data wrangling and more specifically five of the most common tasks you will encounter. We discussed handling missing data, identifying outliers, converting categorical data to numeric, converting numeric data into discrete categories through binning and finally bringing data to common scale through either standardization or normalization.

In our next post, we will continue our journey as we discuss techniques for identifying the most important features through a process called feature selection.

Till next time!


Data Science Day #3 – Classification Models

In the prior post, we discussed at a meta level model selection. In this post, we will explore various modeling techniques with examples. So let’s go…

Getting your own Demo Environment

This entire series will be presented in Python using Jupyter Notebooks. I will also lean heavily on the scikit-learn library ( My intention once I get to a good break, I will revisit the series and provide parallel samples in R.

To follow along you will simply need access to a Jupyter Notebook. The good news is that this is easy and doesn’t require you to install, set-up or configure anything on your local machine if you use Microsoft Azure Machine Learning (Azure ML) studio. The Azure ML service is free to sign-up for and the workspace allows you to create both Jupyter Notebooks and of course machine learning experiments (we will talk about this later). To sign-up for a free account visit:

The Demo Data

We will be using the Iris dataset from the UCI Machine Learning Repository ( since its widely available and well known. The Iris dataset contains various features about flowers and is used to predict the class of flower based on its features. Best of all, using it is simple through the scikit-learn library.

For our demo, we will limit our examples to only the features that describe petal length and width as well as the label. The label is multiclass since there are three classes (setosa, versicolor, virginica) of flowers represented.

Using Python, scikit-learn provides easy access to the dataset and the code to access the data and plot it on a graph is provided below:

from sklearn import datasets 

iris = datasets.load_iris() 

X =[:, [2, 3]] #only use petal length and width
y =

plt.scatter( X[y == 0,0], X[y == 0,1], 
            color ='red', marker ='^', alpha = 0.5) 
plt.scatter( X[y == 1,0], X[y == 1,1], 
            color ='blue', marker ='o', alpha = 0.5)
plt.scatter( X[y == 2,0], X[y == 2,1], 
            color ='green', marker ='x', alpha = 0.5)

The resulting code, generates a plot of our data with the petal length on the X-axis and petal with on the Y-axis..


The features of our data (petal length and width) are both numeric and you can tell by the shape of the data that it is linear separable. So we are at a pretty good place to get started. Before we jump though we need to divide our dataset into training and testing datasets. This is necessary if you recall for supervised learning models since they must be trained.

To get a randomized split, we use the train_test_split function from scikit-learn as seen below using 70% of the data for training and 30% of the data for testing.

from sklearn.cross_validation import train_test_split
import numpy as np 

#Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
     test_size = 0.3, random_state = 0)

Next we will walk through five different models that can be used for classification. The demos are presented as a fairly high level and I will not get into parameter (or hyper-parameter) tuning. In a future post we will discuss parameter tuning techniques such as grid-search in a future post.

It should also be noted that the plots presented were generated by a function Sebastian Raschka presented in his book Python Machine Learning. Out of respect, I will not reproduce his code in this here and will invite you to buy his book if you are interested in the snippet.

Without further ado, let’s get started.

NOTE: You can download a copy of the Python Notebook used for these demos HERE.

Classification Using Logistic Regression

The liner logistic regression algorithm that is built into the scikit-learn library is a probabilistic model that is highly flexible and can be implemented with different solvers. These solvers make it capable of handling both small binary classification as well as multiclass classification on large datasets.

One of the most important parameters in logistic regression is the regularization. This coefficient that is used by the algorithm is represented by the parameter ‘C’ in the code sample below and higher values indicate a weaker regularization.

The default value of this parameter when it is not specified is set to 1.0. Running the experiment with the default values (no C value specified) results in a model that is 68.9% accurate in classifying our data which translates into 14 of the 45 test samples being classified incorrectly.

When we tweak the regularization parameter, say to 1000 as was done below, at the risk of overfitting the model to our training data (more on this later), we come up with much better results. Running the code below results in a model that is 97.8% accurate and only results in 1 of the 45 tests being misclassified.

from sklearn.linear_model import LogisticRegression

#Create a model and train it
lr = LogisticRegression(C=1000), y_train)

pred = lr.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test_std.shape[0],(y_test != pred).sum()))

#Score the model...should result in 97.8% accuracy
score = lr.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

To further illustrate how the models are working, I plot our the decision regions for each trained model using Sebastian Raschka’s technique. This visual is helpful for pulling back the covers and understanding how each algorithm works by plotting multiple point between the min/max values for our X and Y axes which correspond to the petal length and width. The shape of the decision region as you will see if subsequent examples may be liner or non-linear (quadratic) .


Classification Using Naïve Bayes

The second model we will explore is also a probabilistic model that is noted for being extremely high performance and provides good results even though it is relatively unsophisticated. With scikit-learn there are multiple Naïve Bayes algorithms available that you can experiment with. For our demo we will only look at one.

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB(), y_train)

pred = nb.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = nb.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

Running the sample code in which default configuration is used across the board, we wind up with a model that performs really well with our data. Out of the box, the Gaussian Naïve Bayes model matches our prior example with 97.8% accuracy. When we plot the decision regions, note the difference in the shapes that were produced.


Classification Using Support Vector Machines

In our third example, we will look at an implementation of a support vector machine (SVM). SVM models perform very similar to logistic regression models except that we can handle both linear and non-linear data using what’s known as the kernel trick. I have demonstrates this in the code sample below.

from sklearn.svm import SVC

svm = SVC(kernel='linear')
#svm = SVC(kernel='rbf'), y_train)

pred = svm.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = svm.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 


Classification Using Decision Trees

In our fourth example we implement a decision tree classifier. Recall from the prior blog post that a decision tree algorithm seeks to split the data set using a type of rule to maximize information gain and minimize entropy. One of the main parameters or arguments for your tree is the maximum depth (max_depth). This parameter sets the maximum number of levels the algorithm will consider before it stops. Setting the max_depth to a higher level should make the model perform better but you are likely to overfit the model to your training data and it will generally perform more poorly in the real world.

The implementation is straight-forward and like all the models we’ve looked as so far it performs really well with our data (97.8% accuracy).

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', max_depth=3), y_train)

pred = tree.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = tree.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

When we plot the decision regions we can see a pretty big difference however, since the decision tree is split out data rather than attempting to fit a line, we wind up with boxes as seen below.


Classification Using K-Means Clustering

The final classification algorithm we will look at is significantly different that the prior examples. K-means clustering is a form of unsupervised learning. In the prior examples, we had to fit or train our supervised models to our labelled training data. We then could use the trained model to make predictions which we did in the form of scoring the model using our held-out test data.

Clustering on the other hand, does not use labelled data. Instead it seeks to form clusters of points based on the distance . To do typical clustering we much provide at a minimum the number of clusters to form. We can cheat a little here since we know there are three classes we are trying to predict.

For each cluster a random center point or centroid is placed. The algorithm will then iterate (based on our max_iter parameter) and adjust the centroid to maximize the fit of the points to the clusters.

from sklearn.cluster import KMeans
from sklearn import metrics

km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1)

We can now plot our data, including the centroids to better visualize this. Note that the clusters it found are not necessarily tied to our true labels or the ground truth. When we use this model to predict or classify a point, the result will be a cluster number. It is up to you to associate the clusters back to your labels.




In this post we used Python and scikit-learn to experiment with multiple different classification models. We looked as five implementations both supervised and unsupervised methods that are capable of handling various types of data. We still have barely even scratched the surfaced.

In the next post, we are actually going to take a step backwards as we start a discussion on data wrangling and preparation.

Till next time!


Data Science Day 2 – About Model Selection

Continuing this series of Data Science from a practitioners perspective…

In part one of this series we introduced the very basics to one part of machine learning: predictive analytics. In this post, we will begin to dig a little deeper as we dive in and explore some of the more common binary and multiclass classification techniques beginning with an emphasis on understanding the very basics of model selection.

As a quick refresher, recall that classification is one way we can make predictions using supervised learning. If we think in terms of the canonical e-commerce or digital marketing examples we can quickly come up with a couple of scenarios we might want to predict:

  • Scenario 1: Whether a customer/website visitor will purchase our product or will click on a link
  • Scenario 2: Whether an economy or premium product is best suited to our customer

In both of these scenarios we can consider these tasks of classification. Binary classification simply means that we only have two classes or possible states (yes/no, true/false, etc.) we are trying to predict and this typically fits the first scenario.

Multiclass classification then is just an extension of binary classification and involves predicting membership in one of multiple possible classes (i.e.: free, economy, standard, premium) as seen in the second scenario. Most algorithms handle both binary and multiclass classifications using a one-versus-all (OvA) approach in which a classifier for each class is trained and then compared to one another to determine the “best” class.

Understanding model selection

Within the classification space there are a number of mathematical, statistical and machine learning techniques we can apply to accomplish the task at hand. Choosing the appropriate one involves understanding your data, its features and the nuances of each model.

All of the models we are concerned with begin with your data. In supervised learning models, the model you select is iteratively trained. In each iteration a cost or error function is used to evaluate the model. When a defined optimum state or the maximum number of iterations have been reach, the training process is completed. So why does it matter?

The key to this process if your data. More specifically the features (attributes or variable) and examples (rows) within your data. Different models are capable of not only handling different data types but perform differently based on the number of features within your data. For now we won’t concern ourselves with data wrangling, preparation, cleansing, feature selection, etc. as those will be topics of future posts. So about your data….

Raw features appear is a couple of different forms. First and most obvious is a numeric. Numeric features as the name implies are simply numbers that can either be discrete or continuous in nature. Examples of numeric features can include attributes such as age, mileage, income or as we will see in the upcoming examples length and width of flower petals.

The second type of features you will likely encounter in your data is called categorical or nominal. Examples of this type of feature include things like gender, ethnicity, color, and size. Within this type a special sub-type exists to be aware of, the ordinal. Ordinal features are simply categorical features where the order is potentially important. Examples of include sizes (i.e. Small, Medium, Large) or product/customer review ratings.

I’d be remiss if I didn’t briefly discuss the need to cleanse and prep your data. Many of the models we will explore require data preparation steps such as feature scaling to perform their best. Again, I will defer on discussing this until a later post.

So we know we have to consider the type of data we are working with, what else…

Data Shape and Linear, Non-Linearly Separation…Say what?!?

If we generalize, we can think of the each example within our data as a vector (array) of features. To visualize this, you could take each vector and then plot it on a graph or chart. Once your data is plotted, think of the model you choose as being responsible for trying to separate the chart along the charts planes. This gets really hard/impossible with many dimensions (i.e. lots of features) so we will stick with just two-features or a two dimensional example.

If we think in two-dimensional terms, we can easily visualize linear and non-linearly separable data. In the diagrams below, I have generated two plots one linear and the second non-linear. Imagine now you must construct a line to separate the data points.

linear nonlinear

Clearly the diagram on the left is linear since we can easily divide the plot with a line (drawn yellow line). The diagram on the right, however with the moon shapes is illustrative of non-linear data and it cannot be accurately separated with a line. So how why does this matter in model selection?

Simple. Some models are designed for and work best when data is clearly linearly separable, others work on non-linear data, I will spare you the math and we will look at explore examples of both.

So about the models…

Now that we’ve have context for what some of the factors that come into play for model selection, let’s very, very briefly look at non-math introduction to some of the most common models/techniques.

  • Linear Logistic Regression – handles linear data and estimates probabilities (using a logistic function) that a data point is associated with a class. The features for this model must be numeric but it is common to convert categorical features into numeric representations.
  • Naïve Bayes – also uses probabilities for classification and is based on Bayes theorem with a strong (Naïve) assumption of independence between features. Frequently used for text classification (i.e. spam not spam) and performs well will large feature sets and is highly scalable.
  • Support Vector Machines (SVM) – is a non-probabilistic model that is capable of handling both linear and non-linear data. SVM works by trying to optimize/maximize the space between class or categories.
  • Decision Trees – most people are familiar with the graphical tree like representation of decision trees. As decision tree is typically by using an algorithm that seeks to maximize information gain and minimize what’s known as entropy. Entropy measures the homogeneity of a split, for example a split that results in a pure class (i.e. all purchasers) would have an entropy of 0. A split that results in 50/50 split (i.e. purchasers and non-purchasers) would have an entropy of 1. The information gain is simply based on the decrease in entropy after a split occurs. Decision trees are flexible and can handle categorical and numerical data and works on both linear and non-linear data. They are also very transparent in how decisions are made.
  • Neural Networks – modelled after the human brain where multiple layers of neurons exchange messages as part of the process to approximate functions. Neural networks are largely black box and can solve extremely tough classification problems that exists within the fields of computer vision and speech recognition.


In this post, we discussed the basics of model selection as you prepare to get started in a classification exercise. Start by developing a deep understanding of the data you are working with. Remember this is just a starting point and often times, you will want to compare a couple of different approaches to determine which will provide the most utility as we will see in the post.

One of my favorite sayings is that “All models are wrong, but some are useful…” by George Edward Pelham Box. Keep this is mind as you begin your journey as a data science practitioner.

In our next post, we continue the focus on most common classification models using samples to illustrate and understand each of the various techniques.


Till next time!


Data Science Day 1 – Machine Learning for the Rest of Us

My last blog post was more than 9 months ago and at this rate, this blog should probably be presumed dead since not even I really recognize it. At any rate, here I am again, attempting to breathe life back into this oft neglected space and what better way to do it than with a series that focuses on all things machine learning.

Over the course of this blog series, I will introduce a broad and diverse range of machine learning topics from the perspective of a practitioner. These topics will range from a focus on various models and techniques such as classification, clustering and recommenders to a discussion about tools, platforms and challenges such as operationalization of machine learning models.

I’ll spare you the typical machine learning is pervasive/disruptive/transformative/{insert descriptive adjective here} pep talk as I highly suspect that if you are reading this you already understand the value of machine learning. Instead, let’s dive right in to the introduction.

Introduction to Predictive Analytics

Predictive analytics in its simplest form involves using past or historical data (called experiences) to predict future events or behavior. There are a variety of statistical, machine learning and data mining techniques that can be used to meet this end all of which we will classify into either supervised or unsupervised learning techniques.

Supervised Learning

In supervised machine learning, we are interested in training a model using past data were the desirable behavior or event such as a purchase or click is known. This data is said to be labelled and the label is sometime referred to as the ground truth or our target. The inputs into the model known as attributes or features will be used to observe potential patterns which can be exploited to predict the target.

This training step is what differentiates supervised from unsupervised techniques. Following model training, it is necessary to test our model on labelled, unseen or held out data. During this step, the model is used to predict our target (i.e. will user buy/click/etc.) on the test data and then that prediction is evaluated against the ground truth to determine how well the model performs.


Within the context of predictive analytics the two classes of techniques that you will encounter in the supervised space: classification and regression. Classification simply means that we are trying to classify things or events into one (binary) or more (multiclass) groups using a technique such as decision trees, logistic regression or neural networks.

Examples of classification activities are:

  1. Binary – Will customer X buy?
  2. Binary – Is transaction fraudulent?
  3. Binary – Is the email spam?
  4. Multiclass – What data plan is best for customer? (Economy/Standard/Premium)

Regression on the other hand is used when the target we are trying to predict is numerical. A technique like linear regression could be used to predict targets such as the call volume for a call center or a product line sales volume forecast.

Unsupervised Learning

Whereas in supervised learning we had to train a model before it is useful, unsupervised techniques have no such requirement. In fact using the most common unsupervised learning technique, called k-means clustering, our data will be divided or grouped into distinct groups based on the distance between points. These groups will all share similar traits which can be exploited for predictive purposes.


This is common in activities such as customer segmentation or in churn analysis. In both of these activities, the groups formed as part of the clustering exercise are used to predict whether a customer will or won’t buy, whether they will upgrade or even whether they will stay.

So, the next logical question you are probably thinking is….how do you choose the most appropriate technique of task X? Well that is a far more nuanced question that requires a lot more than a single post. My hope is that I will help you discover and answer that question for yourself over the course of this series.

Wrap-up and what’s next?

The purpose of this post was to build a foundation and introduce machine learning in the broadest possible sense. By introducing common techniques and familiarizing yourself with the common terminology the stage is set. In the next post we will begin looking at classification techniques for binary and multiclass predictive analytics.

We will start by exploring some of the most common techniques including linear and non-linear classifiers, decision trees among others. This will then become the basis for a more formal discussion of the data science process including data prep/wrangling, feature selection, data partitioning, model selection and model evaluation.

Till next time!


Azure Machine Learning – Data Understanding

Continuing on with our introduction of the Azure Machine Learning service, we will step back from the high-level demos used in two previous posts and begin a more pragmatic look at this service. This post will explore the methods available for data ingress and highlight some of the most common and useful task in beginning your machine learning experiment.

The Data Mining Methodology

Before we dig in, let’s add some context and review one of the prevailing process methodologies used in machine learning. The CRISP (Cross Industry Standard Process) for Data Mining defines six major phases for data mining or machine learning projects. These six phases are mostly self-explanation and are as follows:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Please note that while these steps are listed in a linear fashion, this methodology is designed to be iterative, such that in subsequent iterations we continually refine our experiment through increased understanding. In this post, we are going to focus on step two: Data Understanding within Azure Machine Learning Services.

Collecting Data

With the CRISP methodology, the Data Understanding phase begins with data acquisition or the initial data collection. The Azure Machine Learning service offers us a number of different options to onboard data into data sets for use within our experiments. These options range from uploading data from your local file system as seen in our initial experiments to reading data directly from a Azure SQL database or the web. Let’s dig in and look at all the options that are currently available.

  • Dataset from a Local File – Uploads a file directly from your file system into the Azure Machine Learning service. Multiple file types are supported and inclued : Comma Separate Value (*.csv), Tab Separate Value (*.tsv), Text ( *.txt), SVM Light (*.svmlight), Attribute Relation File (*.arff), Zip (*.zip) and R Object/Workspace (*.rdata)
  • Dataset from a Reader – The Reader task provides a significant amount of flexibility and allows you to connect to and load data directly from Http (in multiple fomats0, SQL Azure databases, Azure Table Storage, Azure Blob Storage, a Hive Query and from the slightly confusingly named Power Query which is really an OData data.
  • Dataset from an Experiment – The result of an experiment (or any step within the experiment) can also be used by simply right-clicking on the task output and selecting the ‘Save as Dataset’ option.

Understanding Data

After data collection comes understanding as we begin to familiarize ourselves with the data and its many facets. This process is our first insight into the data often starts with a profiling exercise that is used to identify not only potential data quality problems but can also lead to the discover of often interesting subsets of so-called hidden information. During this  phase of data understanding some of the common activities are:

  • Identifying data types (string, integer, dates, boolean)
  • Determining the distribution (Discrete/Categorical or Continuous)
  • Population of values or identification missing values (dense or sparse)
  • Generating a statistical profile of the data (Min, Max, Mean, Counts, Distinct Count, etc)
  • Identifying correlation within the data set

To accomplish facilitate this step, the Azure Machine Learning service provides a number of useful tasks and features. The first and easiest to use is the “Visualize” option found on the context-menu (right-click) of each task output. Using this option provides a basic summarization of the data including a list of columns, basic statistics, a count of unique and missing values and the feature (data) type.



While the Visualize feature is a great tool for initial insight, often to further our understanding we will need a broader and deeper look into the data. For this we have two tasks : Descriptive Statistics and Elementary Statistics both found under the Statistical Functions category.


The Descriptive Statistics task calculates a broad set of statistics including counts, ranges, summaries and percentiles to describe the data set while the Elementary Statistics task independently calculates a single statistical measure for each column which is useful for determining central tendency, dispersion and shape of the data. Both of these task output a tabular report of the results which can be exported and analyzed independent and external to the service.

Finally, we look at the Linear Correlation (Statistical Functions) task. This task calculates a matrix of correlation coefficients which measures the degree of relationship or correlation between all possible pairs of variables or columns within our dataset. Using this correlation coefficient, we can identify how variables change in relation to one another with a coefficient of zero meaning there is no relationship and a value of (+/-) one implying there are perfectly correlated.



In this post, we began a more pragmatic look at the Azure Machine Learning service. Focusing in specifically on the Data Understanding phase of the CRISP Data Mining Methodology, we looked at the various options for both data ingress or collection and the tasks available to help use build an understanding of the data we are working with. In the next post, we will move on to the third phase of the methodology: Data Preparation.

Till next time!