In the prior post, we discussed at a meta level model selection. In this post, we will explore various modeling techniques with examples. So let’s go…
Getting your own Demo Environment
This entire series will be presented in Python using Jupyter Notebooks. I will also lean heavily on the scikit-learn library (http://scikit-learn.org/stable/). My intention once I get to a good break, I will revisit the series and provide parallel samples in R.
To follow along you will simply need access to a Jupyter Notebook. The good news is that this is easy and doesn’t require you to install, set-up or configure anything on your local machine if you use Microsoft Azure Machine Learning (Azure ML) studio. The Azure ML service is free to sign-up for and the workspace allows you to create both Jupyter Notebooks and of course machine learning experiments (we will talk about this later). To sign-up for a free account visit: https://studio.azureml.net/.
The Demo Data
We will be using the Iris dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Iris) since its widely available and well known. The Iris dataset contains various features about flowers and is used to predict the class of flower based on its features. Best of all, using it is simple through the scikit-learn library.
For our demo, we will limit our examples to only the features that describe petal length and width as well as the label. The label is multiclass since there are three classes (setosa, versicolor, virginica) of flowers represented.
Using Python, scikit-learn provides easy access to the dataset and the code to access the data and plot it on a graph is provided below:
from sklearn import datasets iris = datasets.load_iris() X = iris.data[:, [2, 3]] #only use petal length and width y = iris.target plt.scatter( X[y == 0,0], X[y == 0,1], color ='red', marker ='^', alpha = 0.5) plt.scatter( X[y == 1,0], X[y == 1,1], color ='blue', marker ='o', alpha = 0.5) plt.scatter( X[y == 2,0], X[y == 2,1], color ='green', marker ='x', alpha = 0.5) plt.show()
The resulting code, generates a plot of our data with the petal length on the X-axis and petal with on the Y-axis..
The features of our data (petal length and width) are both numeric and you can tell by the shape of the data that it is linear separable. So we are at a pretty good place to get started. Before we jump though we need to divide our dataset into training and testing datasets. This is necessary if you recall for supervised learning models since they must be trained.
To get a randomized split, we use the train_test_split function from scikit-learn as seen below using 70% of the data for training and 30% of the data for testing.
from sklearn.cross_validation import train_test_split import numpy as np #Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
Next we will walk through five different models that can be used for classification. The demos are presented as a fairly high level and I will not get into parameter (or hyper-parameter) tuning. In a future post we will discuss parameter tuning techniques such as grid-search in a future post.
It should also be noted that the plots presented were generated by a function Sebastian Raschka presented in his book Python Machine Learning. Out of respect, I will not reproduce his code in this here and will invite you to buy his book if you are interested in the snippet.
Without further ado, let’s get started.
NOTE: You can download a copy of the Python Notebook used for these demos HERE.
Classification Using Logistic Regression
The liner logistic regression algorithm that is built into the scikit-learn library is a probabilistic model that is highly flexible and can be implemented with different solvers. These solvers make it capable of handling both small binary classification as well as multiclass classification on large datasets.
One of the most important parameters in logistic regression is the regularization. This coefficient that is used by the algorithm is represented by the parameter ‘C’ in the code sample below and higher values indicate a weaker regularization.
The default value of this parameter when it is not specified is set to 1.0. Running the experiment with the default values (no C value specified) results in a model that is 68.9% accurate in classifying our data which translates into 14 of the 45 test samples being classified incorrectly.
When we tweak the regularization parameter, say to 1000 as was done below, at the risk of overfitting the model to our training data (more on this later), we come up with much better results. Running the code below results in a model that is 97.8% accurate and only results in 1 of the 45 tests being misclassified.
from sklearn.linear_model import LogisticRegression #Create a model and train it lr = LogisticRegression(C=1000) lr.fit(X_train, y_train) pred = lr.predict(X_test) print("Number of mislabeled points out of a total %d points : %d" % (X_test_std.shape,(y_test != pred).sum())) #Score the model...should result in 97.8% accuracy score = lr.score(X_test, y_test) print('Model Accuracy: %.3f' % (score))
To further illustrate how the models are working, I plot our the decision regions for each trained model using Sebastian Raschka’s technique. This visual is helpful for pulling back the covers and understanding how each algorithm works by plotting multiple point between the min/max values for our X and Y axes which correspond to the petal length and width. The shape of the decision region as you will see if subsequent examples may be liner or non-linear (quadratic) .
Classification Using Naïve Bayes
The second model we will explore is also a probabilistic model that is noted for being extremely high performance and provides good results even though it is relatively unsophisticated. With scikit-learn there are multiple Naïve Bayes algorithms available that you can experiment with. For our demo we will only look at one.
from sklearn.naive_bayes import GaussianNB nb = GaussianNB() nb.fit(X_train, y_train) pred = nb.predict(X_test) print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape,(y_test != pred).sum())) score = nb.score(X_test, y_test) print('Model Accuracy: %.3f' % (score))
Running the sample code in which default configuration is used across the board, we wind up with a model that performs really well with our data. Out of the box, the Gaussian Naïve Bayes model matches our prior example with 97.8% accuracy. When we plot the decision regions, note the difference in the shapes that were produced.
Classification Using Support Vector Machines
In our third example, we will look at an implementation of a support vector machine (SVM). SVM models perform very similar to logistic regression models except that we can handle both linear and non-linear data using what’s known as the kernel trick. I have demonstrates this in the code sample below.
from sklearn.svm import SVC svm = SVC(kernel='linear') #svm = SVC(kernel='rbf') svm.fit(X_train, y_train) pred = svm.predict(X_test) print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape,(y_test != pred).sum())) score = svm.score(X_test, y_test) print('Model Accuracy: %.3f' % (score))
Classification Using Decision Trees
In our fourth example we implement a decision tree classifier. Recall from the prior blog post that a decision tree algorithm seeks to split the data set using a type of rule to maximize information gain and minimize entropy. One of the main parameters or arguments for your tree is the maximum depth (max_depth). This parameter sets the maximum number of levels the algorithm will consider before it stops. Setting the max_depth to a higher level should make the model perform better but you are likely to overfit the model to your training data and it will generally perform more poorly in the real world.
The implementation is straight-forward and like all the models we’ve looked as so far it performs really well with our data (97.8% accuracy).
from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(criterion='entropy', max_depth=3) tree.fit(X_train, y_train) pred = tree.predict(X_test) print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape,(y_test != pred).sum())) score = tree.score(X_test, y_test) print('Model Accuracy: %.3f' % (score))
When we plot the decision regions we can see a pretty big difference however, since the decision tree is split out data rather than attempting to fit a line, we wind up with boxes as seen below.
Classification Using K-Means Clustering
The final classification algorithm we will look at is significantly different that the prior examples. K-means clustering is a form of unsupervised learning. In the prior examples, we had to fit or train our supervised models to our labelled training data. We then could use the trained model to make predictions which we did in the form of scoring the model using our held-out test data.
Clustering on the other hand, does not use labelled data. Instead it seeks to form clusters of points based on the distance . To do typical clustering we much provide at a minimum the number of clusters to form. We can cheat a little here since we know there are three classes we are trying to predict.
For each cluster a random center point or centroid is placed. The algorithm will then iterate (based on our max_iter parameter) and adjust the centroid to maximize the fit of the points to the clusters.
from sklearn.cluster import KMeans from sklearn import metrics km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1) km.fit(X_train)
We can now plot our data, including the centroids to better visualize this. Note that the clusters it found are not necessarily tied to our true labels or the ground truth. When we use this model to predict or classify a point, the result will be a cluster number. It is up to you to associate the clusters back to your labels.
In this post we used Python and scikit-learn to experiment with multiple different classification models. We looked as five implementations both supervised and unsupervised methods that are capable of handling various types of data. We still have barely even scratched the surfaced.
In the next post, we are actually going to take a step backwards as we start a discussion on data wrangling and preparation.
Till next time!