Data Science Day #7 – Model Evaluation

Outside of the introduction to some of the common algorithms or models used for a classification task, this series has focus on all the small, yet important foundational things that need to be handled or considered prior to modeling and model selection. In this post we are going to drive the bus back in from left field and discuss model evaluation within the context of a supervised classification activity. Without further ado, let’s get started.

 Getting Data

We’ve spent a number of post talking about preparing data for optimal use in a machine learning activity and there’s one more step that needs to be considered. Since we are going to be working with supervised machine learning techniques, our models are going to require sets of data for training. Likewise, we talk about model evaluation shortly, we will need data for that too. Further complicating matters is the need for validation data once we discuss model selection. So how do we go about dividing out data appropriately?

Believe it or not this is a relatively easy task. Accomplishing this is as easy as simply splitting data randomly by some percentage, say use 80% for training and leaving 20% untouched for testing and evaluation. We could then take out training data and further carve out a set of validation data to help prevent us from overfitting our model during model selection. This is classically known as the holdout method while the percentages vary with 70/30 and 80/20 being common splits it is widely used because of its simplicity.

A second alternative to the holdout technique is known as cross validation. Cross validation has a few different implementations with the most common know a k-fold cross validation. In a cross validation the data set is divided in k-folds (for example 5) where each observation exists in only a single fold and each fold takes a turn both as a part of the training set and as a validation set. To better illustrate the differences between the holdout and cross validation with k-folds a diagram has included below.

Validaton

Before moving on it’s important to note that the way you divide up your data can have a significant impact on model performance. You shouldn’t shortcut yourself into potential issues like over fitting a model by reusing your training data later as validation data during model selection or not paying attention to small details like randomization when splitting up your data.

 Model Evaluation

So we have our data and we’ve selected a model or potentially models that we suspect will work nicely with our data. How do we evaluate the model’s actual performance? At first glance its simple, right? This is a classification activity and regardless of whether it’s a binary or multiclass problem, the model either picked the right class or it didn’t. This task is often referred to a scoring and involves testing our chosen model on a set of un-seen data to see if its predicted class matches the our known label or the ground truth.

Unfortunately though once we’ve trained, predicted and scored our model, there are multiple ways or metrics we can use to measure how our model performed. Let’s briefly look at some of the most common ways to evaluate our model performance.

Confusion Matrix

The simplest method for evaluating a models performance at a classification task is known as the confusion matrix (pictured below). The confusion matrix in a binary classification task is made up of four quadrants each containing a straight count of how our model’s predicted class aligns our ground truth or known label. The four quadrants in this example matrix are:

  • True Positive (TP) – predict true, known true
  • False Negative (FN) – predict false, known true, sometimes called Type-2 error
  • False Positive (FP) – predict true, known false, sometimes called Type-1 error
  • True Negative (TN) – predict false, known false

To further illustrate how this works, let’s consider a simple binary classification task where we are trying to predict whether a credit card transaction is fraudulent or not. In this case, we are simple predicting either true or false, with true indicating a fraudulent transactions. In the example pictured below our model would have correctly classified 92 fraudulent transactions and another 104 as non-fraudulent. Further, it misclassified 28 transactions as okay while they were in fact fraudulent and finally 26 as fraudulent while they were in fact okay transactions. Taken together this gives us a straight-forward summary of how our selected model performed and will be the basis of many different performance metrics we will discuss next.

confusion_matrix

 

Performance Metrics

When I started this series my goal was to do the whole thing without invoking a single math formula. While my intentions where probably good, like many of the politicians filling the airwaves I’m going to backtrack and present some math to highlight the various derivations that can be made from our confusion matrix. Don’t worry you won’t be tested on this stuff as it’s presented only for conceptual understanding and all of these metrics are often provided for you by ML libraries such as scikit-learn.

Error

Summarizes the percentage of observations that were misclassified by the model.

error

Accuracy

Summarizes the percentage of observations that were correctly classified by the model.

Accuracy

Precision

Can be calculated for either positive or negative results and is the proportion of true positives to all positive predictions.

Precision

Recall

Also referred to as sensitivity and is the proportion of true positive classifications to all cases which should have been positive. Higher recall means that we are correctly classify most positive cases and ignores false positives. This metric can also be calculated for negative results and is called specificity.

RecallSpecificity

F1-Score

Measures overall accuracy by considering both precision and recall, weighted equally using the harmonic mean. The highest F-1 score is one with the lowest being zero. This measure is typically used to compare one model to another.

F1

Receiver Operator Characteristics (ROC) Curve

The ROC curve is a visual plot that compares true-positive rate to the false-positive rate. In the plot below, we can note that a perfectly predictive model follows the axis-borders and a model with no predictive value (i.e. random guessing) forms the diagonal. We can plot and compare multiple models to one another. Using these plots we can generate another measure called area under the curve (AUC) to quantify the graphically presented data..

 

 

Model Evaluation & Selection

Now that we’ve explored some of the tools we have at our disposal to evaluate the performance of our model, let’s look at a practical application for how these metrics can be leveraged in evaluating whether our model is useful or not. Continuing with our credit card fraud model, let’s review and interpret the example results for this binary classification task.

  • Accuracy -97.7% – Very general in telling us that our model correctly predicted the correct label for 97.7% of our samples. It does not inform us of whether those misses were primarily one class or another. In our example this can be potentially problematic if for example we were extremely sensitive to fraud yet our model primarily missed on fraudulent charges. Likewise this can be misleading in cases where we have significant class imbalances
  • Consider a situation where 99.5% of transactions were okay and only 0.5% of transactions were fraudulent. If our model just simply guessed that every transaction was okay it would be 99.5% accurate but would have missed every single fraudulent charge.. For this reason, we want to typically consider some of the following metrics.
  • Precision -74,4% – Our model was able to successfully target 74.4% of the fraudulent transactions while limiting the number of false positives. This would be useful if we were worried about having a model that was overly aggressive and favored a more conservative model to limit false alarms.
  • Sensitivity (Recall) – 95.5% – Our model correctly classified 95.5% of fraudulent transactions meaning that 4.5%  of fraudulent transactions were missed. Whereas our precision metric takes into consideration false-positives, this metric is only concerned truly fraudulent transactions. If our tolerance for fraud were very low, we would be interested in optimizing our model for this metric.
  • Specificity – 97% – Indicates that 97% of good transactions were correctly classified and that 3% of good transactions were misclassified as fraudulent. Very similar to precision in that optimizing this metric would lead to a more conservative model if we were overly concerned with sounding false alarms.

As you have probably surmised by this point, it is nearly if not totally impossible to build a model that performs well across all the metrics. In the real-world we typically focus one area and part of the model selection process involves tuning or optimizing our model for the task at hand. This tuning process is called hyperparameter tuning and will be the focus of our next posts.

Metrics Example

Since this post has been code-free so far (and I’m itching for some code), I’ve included a brief sample that you can run to highlight how several of these metrics can be calculate using Python and scikit-learn. The example uses a generated set of data with the emphasis being places on how the various metrics discussed to this point can be calculated.

from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, n_informative=5)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

classifier = LogisticRegression()
y_pred = classifier.fit(X_train, y_train).predict(X_test)

#Confusion Matrix
print(confusion_matrix(y_test, y_pred))

#Accuracy Score
print(accuracy_score(y_test, y_pred))

#Precision Score
print(precision_score(y_test, y_pred))

#Classfication Report
print(classification_report(y_test, y_pred, target_names=['No', 'Yes']))

 

Wrap-Up

In this post we discussed the various facets for evaluating a classification model. From discussing a straight-forward strategy for dividing our experiment data to an introduction of some of the basic metrics that can be used to measure model performance we’ve set the stage for model evaluation. In the next posts, we will look at a strategy to tune our model using a hyperparameter tuning technique called grid search to optimize our model for a given target metric.

 

Till next time!

Chris

Data Science Day #6 – Dimensional Reduction

Between fending off bad food and crisscrossing time zones, the wheels have come off this series schedule. At any rate….In the last post or the fifth iteration of this blog series, we looked at several techniques for ranking feature importance and then one method for subsequently doing feature selection using recursive feature elimination (RFE). We used these feature selection methods as a means to address datasets which are highly dimensional thus simplify the data in the hopes that it improves our machine learning model. In this post, we will wrap up the data preparation by looking at another way for us to handle highly dimensional data through a process known as either feature extraction or more accurately dimensional reduction.

Feature Selection vs Dimensional Reduction

Before diving in, it’s worth spending a few moments to define at least some context. We spent the prior post discussing feature selection and for many this probably seems relatively intuitive. If we think in terms of a dataset of data with many columns and rows, feature selection is all about reducing the number of columns by simply removing them from the dataset. In this case the data is lost in effect since it is not included when we go through our model training process. We attempt to minimize this data lost by selecting the most meaningful features as we saw previously.

Dimensional Reduction on the other hand is more of a compression technique so to speak. Specifically it seeks to compress the data without losing information. Continuing with our analogy we reduce the number of features or columns in our data by combining or projecting one or more of the columns into a single new column. This in essence is a transformation being applied to your data and as we will see shortly there are two common methods used to accomplish this.

Principal Component Analysis (PCA)

PCA is a unsupervised reduction technique that tries to maximize variances along new axes within a dataset. Don’t worry if your head is spinning and the math nerds around you are giggling. Let’s look at an example to better illustrate what PCA does and how it works. We will look at both a linear and non-linear data sets illustrated below.

Since we are interested in doing classification on these data sets to predict class and most methods look to linearly separate data for this purpose the ideal outcome of any technique applied to the data would be to enhance the divisions of the data to make it more readily separable. Now obviously these are neither big nor particularly complex in nature but let’s see if we can use PCA to reshape our datasets.

Linear Data Set
from sklearn.datasets import make_classification

X, y = make_classification(n_features=6, n_redundant=0, n_informative=6,
                    random_state=1, n_clusters_per_class=1, n_classes=3)

plt.scatter(X[y==0, 0], X[y==0, 1], color='red', alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', alpha=0.5)
plt.scatter(X[y==2, 0], X[y==2, 1], color='green', alpha=0.5)
plt.title('Linear')
plt.show()

LinearData

Non-Linear Data Set
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)

plt.figure(figsize=(8,6))

plt.scatter(X[y==0, 0], X[y==0, 1], color='red', alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', alpha=0.5)
plt.title('Concentric circles')
plt.show()

NonLinearData

Beginning with the linear example, we created a “highly” dimensional set with six meaningful features. The initial plot is little bit of a mess and so we will apply PCA to the data set and reduce the dimensionality to just two features. The code sample provided uses scikit-learn and then plots the results. As you can see we now have a much simpler data set that is clearly easier to separate.

from sklearn.decomposition import PCA

pca = PCA(n_components=2) 
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[y==0, 0], X_pca[y==0, 1], color='red', alpha=0.5)
plt.scatter(X_pca[y==1, 0], X_pca[y==1, 1], color='blue', alpha=0.5)
plt.scatter(X_pca[y==2, 0], X_pca[y==2, 1], color='green', alpha=0.5)

plt.title('Linear PCA')
plt.show()

LinearPCA

The next example is more challenging and is representative of a non-linear dataset. The concentric circles are not linear separable and while we can apply a non-linear technique we can also use PCA. When dealing with non-linear data, if we just apply standard PCA, the results are interesting but still not linearly separable. Instead, we will use a special form of PCA that leverages the same kernel trick we saw applied to support vector machines way back in the second post. This implementation is called KernelPCA in scikit-learn and the code and resulting plot is provided below.

Linear Techniques against Non-Linear Data
from sklearn.decomposition import PCA

pca = PCA(n_components=2) 
X_pca = pca.fit_transform(X)

plt.scatter(X[y==0, 0], np.zeros((500,1))+0.1, color='red', alpha=0.5)
plt.scatter(X[y==1, 0], np.zeros((500,1))-0.1, color='blue', alpha=0.5)
plt.ylim([-15,15])
plt.title('Linear PCA on Non-Linear Data')
plt.show()

LinearPCAOnNonLinearData

Kernel PCA on Non-Linear Data
from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)

plt.scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], color='red', alpha=0.5)
plt.scatter(X_kpca[y==1, 0], X_kpca[y==1, 1], color='blue', alpha=0.5)

plt.title('Kernel PCA')
plt.show()

KernalPCA

Note that we’ve transformed a non-linear data set into a linear dataset and can now use linear-based machine learning techniques like logistic regression, support vector machines or naïve bayes.

The one point we haven’t discussed is the number of components we should target during the reduction process. The samples as you’ve probably caught on somewhat arbitrarily use two even though in the case of the linear dataset we have six meaningful features. We can visually plot out the amount of explain variance using PCA as seen in the code below. Using this code and the resulting plot allows us to select a meaningful number that explains the maximum amount of variance within our dataset.

from sklearn.decomposition import PCA
pca = PCA() 

X_pca = pca.fit_transform(X)

plt.figure(1, figsize=(4, 3))
plt.clf()
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained variance')

explained_variance

Linear Discriminant Analysis (LDA)

LDA unlike PCA discussed in the previous section, is a supervised method to that seeks to optimize class separability. Since it is a supervised method for dimensional reduction, it uses the class labels during the training step to do its optimization. Note that LDA is a linear method and that it cannot (or more accurately…should not) be applied to non-linear datasets. That being said let’s look at a code example as applied to our linear dataset.

from sklearn.lda import LDA 

lda = LDA(n_components=2) 
X_lda = lda.fit_transform(X,y)

plt.scatter(X_lda[y==0, 0], X_lda[y==0, 1], color='red', alpha=0.5)
plt.scatter(X_lda[y==1, 0], X_lda[y==1, 1], color='blue', alpha=0.5)
plt.scatter(X_lda[y==2, 0], X_lda[y==2, 1], color='green', alpha=0.5)

plt.title('Linear LDA')
plt.show()

LinearLDA

In the preceding code, we used our sample data to train the model and the transform or reduce our feature set down from six to two. Both were accomplished in the fit_transform step. When we plot the result we can observe at first glance that using the LDA method has resulted in better linear separability among the three classes of data.

Wrap-Up

In this post, we looked at using Principal Component Analysis and Linear Discriminant Analysis for reducing the dimensionality of both linear and non-linear data. The one question that we left unanswered however is given the two methods, which method is better? Unfortunately as you might have guess “better” is relative and there is not a clear answer. As with most things in this space they both are varyingly useful in various situations. In fact sometimes they are even used together. I’d invite you to discover both of these more in depth if you are so inclined.

In our next post on this random walk through machine learning, we are going to leave behind data preparation and move on to modeling or more accurately model selection. We will look at how models are evaluated and some of the common measures you can use to determine how well your selected model performs.

Till next time!

Chris

Data Science Day 5 – Feature Selection

In my prior post, we introduce the CRISP-DM methodology as a broad process to guide us on our adventures in machine learning. We then continued this somewhat random walk, but looking at several common techniques for handling or wrangling data in data preparation step. In this post, we will continue this little adventure by discussing another facet of the data preparation step: feature selection.

The Curse of Dimensionality

In your various travels through the world of data, you’ve probably been given that advice that the more data you can throw at a problem the better off you will be. Well in most cases this is true, except when it’s not…let me explain.

The curse of dimensionality within the context of machine learning, refers to phenomenon in which we have highly-dimensional data (a lot of features/attributes/columns) resulting in data sparsity that many machine learning models are ill equipped at handling. This ill fit can be either a result of a poorly performing model (i.e. poor accuracy or generalization/overfitting) or a model that computationally too expensive in terms of either resources or time.

When we encounter this issue, we have two options:

  • Find a different model that is more suited to our highly dimensional model
  • Reduce the dimensionality of our data

As you might have guessed, this post is about the latter as we will explain feature selection as a means to reduce dimensionality by dialing in on those important features within our dataset. Note that a secondary option called feature extraction, in which we reduce dimensionality by reshaping out existing data can also be used and this will be the subject of the next post.

Be more selective

While there may be value in all the data within our dataset, typically some features or dimensions are going to be more predictive than others. In this regard, it may be possible to train our model using only those most predictive features thereby reducing the dimensionality and complexity of our data. So how do we go about identifying or “selecting” these features?

Correlation

If you are like me and have been away from school for more than just a few years, it may be useful to start our discussion by doing a small review. Recalling from that Statistics class you took, correlation is the measure of relationship between two independent variables and is represented as a value or coefficient between 1 and -1.

Correlation measures how two variables move together and a correlation value of zero is said to indicate no relationship between variables. Meaning they are unrelated. A correlation value of 1 indicates a relationship that is perfectly positively correlated meaning the two values stay in sync and either grow or shrink together. A correlation value of -1 indicate perfectly negative relationship, meaning an inverse relationship exists. In this case as one variable grows, the other shrinks and vice versa.

We can leverage correlation to help identify relationships for our predictive model and calculating a correlation matrix is trivial in most statistical tools. Let’s look at a quick example.

In the code below, we are again leaning on the Iris dataset and using Python and the Pandas library to solve for the correlation coefficient between each variable or feature in the dataset. Note that I have combined the feature and target data into a single data frame.

import pandas as pd
from pandas import DataFrame
from sklearn import datasets 

iris = datasets.load_iris() 

X = iris.data
y = iris.target

feature_names = iris['feature_names']

df = pd.DataFrame(data=iris.data, columns=feature_names)
df['target'] = iris.target
df.corr()

The corr() function call produces a matrix of coefficients as seen below and as expected you can see that each variable is perfectly correlated to itself.

corr

We can disregard most of the chart since the primary relationships we are interested in is between our target and features. What do you notice? Straight away clearly petal length and width have the strongest correlation with our target while the sepal length is somewhat weaker.

As you can see this was fairly easy and while informative it’s rather unsophisticated in how we arrive at understanding feature importance. If you’ve made it this far in the series you know by now that there are more sophisticated methods at our disposal.

Swing in the trees

A second method that we can use to understand feature importance relative to our target leverages ensembles of decision trees. Decision tree ensembles such as random forests, when trained produce a weight or score of the feature importance within the model. We can use this value to rank features much like we did with correlation.

In the code below, I used the ExtraTreesClassifier within scikit-learn to train a model against the iris dataset. Since this is an ensemble, referred to as a forest in this case, the classifier has multiple trees. For each tree that makes up our forest we summarize the feature importance by taking the standard deviation into a single sorted array. We use the remaining code to print out the sorted results in both textual and graphical format.

from sklearn.ensemble import ExtraTreesClassifier

forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, 
    feature_names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
fig = plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])

ax = fig.add_subplot(111)
xtickNames = ax.set_xticklabels(feature_names)
plt.setp(xtickNames, rotation=90, fontsize=10)

plt.show()

trees

When we compare the results of this method with our correlation findings, we confirm that ranking in terms of importance is identical.

Feature Elimination

So far, we have only discussed methods for identifying and ranking features by their importance. If our goal however is to reduce the dimensionality of our dataset by reducing the number of features, then at some point we need to draw the proverbial line in the sand. So we ask ourselves, where should that line be? What is the right number of features?

We could start by arbitrarily pick some “n” number of features and begin iteratively evaluating our model either manually or by using a technique such as Recursive Feature Elimination (RFE) which I’ve done in the Jupyter notebook provided. While these may get us going we can lean again on the scikit-learn library and use a method known as Recursive Feature Elimination with Cross Validation (RFECV) to get a more intelligent answer.

Using this method, we will as the name implies recursively evaluate a defined model against our dataset using a stratified dataset. The scoring criteria is specified by us and is evaluated against each combination of features. From the results, we identify the optimal number of meaningful features.

In the code below we are using a linear support vector machine model to recursively evaluate our feature set optimizing for accuracy.

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")

rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
              scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

rfecv1

Unsurprisingly since we are using a toy dataset for our demo, the results indicate that all four of our features should be included in our model. To show the overall improvement in accuracy (our stated scoring criteria), we could plot the results.

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("# of features")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

rfecv2

We can get a better sense of this functions value if we use a generated dataset. The next snippet of code does just that and although the generated data is nonsensical, I hope that if gives you a better feel for what this function is accomplishing.

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification

# Use 5 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)

svm = SVC(kernel="linear")
rfecv = RFECV(estimator=svm, step=1, cv=StratifiedKFold(y, 2),
              scoring='accuracy')
rfecv.fit(X, y)

rfecv3

Wrap-Up

In this post we discussed feature selection. We began by highlighting a couple of different techniques for ranking features by importance including correlation and using decision trees. We also looked at using recursive feature elimination as a way to intelligently trim our feature set. Keep in mind that this post is not an exhaustive discussion on all things feature selection and is meant only as an introduction.

In the next post, we will dive into a different technique that we can also use to reduce the dimensionality of our data, feature extraction.

Till next time,

Chris

Data Science Day #4 – Data Wrangling

In this preceding post we spent a fairly considerable about of time to understand both supervised and unsupervised machine learning methods for doing predictions or classifications. With that context, we are going to loop back and begin working our way through a “typical” machine learning project.

Project Framework

Like most other technology projects there is a common framework for tackling machine projects that we can use to guide us from start to finish. Known as the CRISP-DM or Cross-Industry Standard Process for Data Mining, this iterative framework sets up guide posts that we will use.

crisp-dm_process_diagram

While it is beyond the scope of this post to discuss each step in detail, we can summarize the broad steps.

  • Business Understanding – understand the project objectives and define business requirements. This is the step where we outline the problem definition.
  • Data Understanding – initial data collection, data familiarization, profiling and identification of data quality problems. Typically this is the most time consuming step with some estimates saying 90% of project time is spent here.
  • Data Preparation – this is where we do our data wrangling or data preparation and transformation. We want to identify not only the data we will use but also the relevant features for the task at hand.
  • Modeling – model selection and calibration including model tuning/calibration
  • Evaluation – evaluate or measure model results against the stated business objectives
  • Deployment – deploy and operationalize model including maintenance plans for the required care and feeding

Note that this process is highly iterative. Feedback is important from step to step and the process is evolutionary in that feedback is often reincorporated in previous steps meaning that where you start is likely to not be where you end up.

Business & Data Understanding

We are not going to spend any real time discussing these points beyond saying there are a plethora of tools and techniques that you can leverage on the technical side. If you spent time doing traditional ETL or business intelligence types of task then you have inevitably encounter data quality issues which must be corrected before moving forward.

Data Wrangling

The remainder of this post will focus on data preparation or wrangling the data into a state that it can be used for machine learning. Specifically we will focus in on five of the most common task encountered: handling missing data, dealing with outliers, handling categorical or non-numeric data, binning or bucketing numeric data and finally data standardization or normalization.

Handling Missing Data

Data is not only often dirty, it is often incomplete. Missing data elements in your dataset can appear as NULL values, blanks or other placeholders and not only happen regularly in the real-world they are often beyond your control. Compounding the problem these missing values are typical incompatible with machine learning models or can lead to suboptimal results. So what can we do when we encounter this scenario?

The first and most obvious answer is that the data can be removed. When we talk data removal, this can mean eliminating the offending row or record of data or eliminating the entire feature (column) for the data set. But removing this data comes at the cost of information loss and can be particular problematic in smaller datasets where every observation or record is valuable.

So we need an alternative and those comes in the form of what’s known as imputed values. As the name implies, this technique for handling missing values allows us to substitute the missing value with a meaningful value. This meaningful value is generated using a strategy that is typically based on one of the measures of central tendency (mean, median, mode-most common value).

We can demonstrate using this technique using the sample provided straight-out of the scikit-learn documentation. Note that there are equivalent features in nearly every toolset (R, AzureML, etc) if you are not using Python and scikit-learn.

In the sample code, below note that we create out imputer using the mean strategy and identify what constitutes a missing value (NULLs are represented as NaN in Python). Next we go through a fit process where the imputer learns the mean or whatever strategy you chose for the data set. The transform call on the imputer substitutes out the missing values in the dataset. The sample code and output is provided below.

import numpy as np
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]

print(imp.transform(X))  

The results of running the code above are displayed below.
missing_values

Detecting and Handling Outliers

Outliers in your dataset are observations or points that are extreme. Outliers can be valid (think comparing Bill Gates net worth to the net worth of people in your surround Postal Code) or invalid points (occurring as a result of measurement error) and have the potential to skew your machine learn model leading to poor or suboptimal results.

The first step in dealing with outliers is being able to identify them and the first place to start is through looking for values which fall outside the possible range of values. Examples of these can include negative values where they are not expected (i.e. negative values for height/weight of a person), unrealistic numbers (such as infant weights great than 100 lbs.) or other values which shouldn’t or can’t occur. This is what I call the smell test and can be identified using basic summary statistics (Minimum, Maximum, etc.).

Next, if your data follows a normal distribution, you can potentially use the median and standard deviations following 68-95-99.7 rule to identify outliers. Note that this is not always considering statistically sound since it is highly dependent on the distribution of data (meaning you should proceed with the utmost care and caution).

Finally, since we are discussing machine learning, you could guess that there machine learning techniques which allow us to identify both inliers and outliers. Out of the box scikit-learn has a couple of options for support for outlier detection (http://scikit-learn.org/stable/modules/outlier_detection.html). You could also resort to using a model such as linear regression to fit a line to your data and then remove some percentage of the data based on the error (predicted value vs actual).

The task at hand determines how we handle outliers. Some models are sensitive to outliers and leaving them in place can lead to skew and ultimately bad predictions (this is common in linear regression). Other models are more robust and can handle outliers meaning it is not necessary to remove them.

Handling Categorical Data

Many machine learning models have specific requirements when it comes to feature data. Most machine learning models are not capable of handling categorical data and instead require all data to be numeric. These cases require that we convert our data..

Categorical data consist of two types of data and we first must determine what variety of categorical data we are dealing with. Ordered categorical data or ordinal data has an implicit order (i.e. clothing sizes, customer ratings) and in many case simply substituting the order sequence number for the category label will suffice. In non-ordered data this won’t work since some models may inappropriate derive a relationship based on the numeric order.

In this situation we use a process to encode the categories as a series of indicators or dummy variables. These dummy variables are simply binary fields and function as a pivoted set of categories. To illustrate this consider a feature that has three possible categories: red, blue and orange.

When encoded as dummy variables, our dataset would have three new features, is_red, is_blue and is_orange, where each features values is either 0 or 1 based on the sample categorical value. To create these dummy features in Python we can use the Pandas library. In the sample code below the get_dummies function handles the pivot and transforms our dataset in a single step.

import pandas as pd

data = ['red', 'blue', 'orange']

pd.get_dummies(data)

categorical

Binning and Bucketing Numerical Data

Another case that in which we may want to transform our data involves situations where our data is numeric and/or continuous in nature. In many case we may choose to bin or bucket it either because it is required or is necessary to optimize our model.

This process is referred to by many names including discretizing, binning or bucketing and involves converting our data into a discrete or fixed number of possible values. While there are multiple was to accomplish this ranging from fixed bucket to using quartiles we will look at one example using a fictitious customer annual income feature.

import pandas as pd

income = [10000, 35000, 65000, 95000, 1500000, 5000000]
bins = [0, 30000, 75000, 250000, 1000000000]
names = ['Low', 'Middle', 'High', 'Baller']

pd.cut(income, bins, labels=names)

The above code contains an array of incomes ranges from $10,000 to $5,000,000. We’ve defined bins and labels and then used the Pandas cut function to slot each value into the appropriate bucket. The results are displayed below.

continous

Data Standardization and Normalization

The final data wrangling technique we will look at involves bringing features within out dataset to common scale. But why is this necessary? Often times the features in our dataset includes data with a diverse set of scales, for example sales quantity in unit and sales revenue in dollars. If these features do not share a common scale many machine learning models may weigh the largest feature more heavily skewing the model and given potentially inaccurate results.

Two common techniques for these include standardization and normalization and the difference between the two is subtle. Both techniques will bring features to a common scale, using standardization we rescale all values to new scale between 0 and 1. This technique is useful when we are trying to measure distance between points such as when doing k-means clustering.

The second technique, normalization scales data by shifting it to have a mean of 0 and a standard deviation of 1. This technique will preserves the distributed of data when it is important.

To demonstrate these techniques we can use the iris dataset from the prior post and the preprocessing functions in scikit-learn and compare the minimum and maximum values before and after the transformations are applied.

from sklearn.datasets import load_iris
from sklearn import preprocessing

iris = load_iris()
x = iris.data

normalized_x = preprocessing.normalize(x)
standardized_x = preprocessing.scale(x)

print("Original Min %.3f - Max %.3f" % (x.min(), x.max()))
print("Normalize Min %.3f - Max %.3f" % (
	normalized_x.min(), normalized_x.max()))
print ("Standardized Min %.3f - Max %.3f" % (
	standardized_x.min(), standardized_x.max()))

Note that in the resulting screen snip the normalized values all now all between 0 and 1 and the standardized values are now center on 0.

scaling

Wrap-Up

In this post, we explored data wrangling and more specifically five of the most common tasks you will encounter. We discussed handling missing data, identifying outliers, converting categorical data to numeric, converting numeric data into discrete categories through binning and finally bringing data to common scale through either standardization or normalization.

In our next post, we will continue our journey as we discuss techniques for identifying the most important features through a process called feature selection.

Till next time!

Chris

Data Science Day #3 – Classification Models

In the prior post, we discussed at a meta level model selection. In this post, we will explore various modeling techniques with examples. So let’s go…

Getting your own Demo Environment

This entire series will be presented in Python using Jupyter Notebooks. I will also lean heavily on the scikit-learn library (http://scikit-learn.org/stable/). My intention once I get to a good break, I will revisit the series and provide parallel samples in R.

To follow along you will simply need access to a Jupyter Notebook. The good news is that this is easy and doesn’t require you to install, set-up or configure anything on your local machine if you use Microsoft Azure Machine Learning (Azure ML) studio. The Azure ML service is free to sign-up for and the workspace allows you to create both Jupyter Notebooks and of course machine learning experiments (we will talk about this later). To sign-up for a free account visit: https://studio.azureml.net/.

The Demo Data

We will be using the Iris dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Iris) since its widely available and well known. The Iris dataset contains various features about flowers and is used to predict the class of flower based on its features. Best of all, using it is simple through the scikit-learn library.

For our demo, we will limit our examples to only the features that describe petal length and width as well as the label. The label is multiclass since there are three classes (setosa, versicolor, virginica) of flowers represented.

Using Python, scikit-learn provides easy access to the dataset and the code to access the data and plot it on a graph is provided below:

from sklearn import datasets 

iris = datasets.load_iris() 

X = iris.data[:, [2, 3]] #only use petal length and width
y = iris.target

plt.scatter( X[y == 0,0], X[y == 0,1], 
            color ='red', marker ='^', alpha = 0.5) 
plt.scatter( X[y == 1,0], X[y == 1,1], 
            color ='blue', marker ='o', alpha = 0.5)
plt.scatter( X[y == 2,0], X[y == 2,1], 
            color ='green', marker ='x', alpha = 0.5)

plt.show()

The resulting code, generates a plot of our data with the petal length on the X-axis and petal with on the Y-axis..

iris

The features of our data (petal length and width) are both numeric and you can tell by the shape of the data that it is linear separable. So we are at a pretty good place to get started. Before we jump though we need to divide our dataset into training and testing datasets. This is necessary if you recall for supervised learning models since they must be trained.

To get a randomized split, we use the train_test_split function from scikit-learn as seen below using 70% of the data for training and 30% of the data for testing.

from sklearn.cross_validation import train_test_split
import numpy as np 

#Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
     test_size = 0.3, random_state = 0)

Next we will walk through five different models that can be used for classification. The demos are presented as a fairly high level and I will not get into parameter (or hyper-parameter) tuning. In a future post we will discuss parameter tuning techniques such as grid-search in a future post.

It should also be noted that the plots presented were generated by a function Sebastian Raschka presented in his book Python Machine Learning. Out of respect, I will not reproduce his code in this here and will invite you to buy his book if you are interested in the snippet.

Without further ado, let’s get started.

NOTE: You can download a copy of the Python Notebook used for these demos HERE.

Classification Using Logistic Regression

The liner logistic regression algorithm that is built into the scikit-learn library is a probabilistic model that is highly flexible and can be implemented with different solvers. These solvers make it capable of handling both small binary classification as well as multiclass classification on large datasets.

One of the most important parameters in logistic regression is the regularization. This coefficient that is used by the algorithm is represented by the parameter ‘C’ in the code sample below and higher values indicate a weaker regularization.

The default value of this parameter when it is not specified is set to 1.0. Running the experiment with the default values (no C value specified) results in a model that is 68.9% accurate in classifying our data which translates into 14 of the 45 test samples being classified incorrectly.

When we tweak the regularization parameter, say to 1000 as was done below, at the risk of overfitting the model to our training data (more on this later), we come up with much better results. Running the code below results in a model that is 97.8% accurate and only results in 1 of the 45 tests being misclassified.

from sklearn.linear_model import LogisticRegression

#Create a model and train it
lr = LogisticRegression(C=1000)
lr.fit(X_train, y_train)

pred = lr.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test_std.shape[0],(y_test != pred).sum()))

#Score the model...should result in 97.8% accuracy
score = lr.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

To further illustrate how the models are working, I plot our the decision regions for each trained model using Sebastian Raschka’s technique. This visual is helpful for pulling back the covers and understanding how each algorithm works by plotting multiple point between the min/max values for our X and Y axes which correspond to the petal length and width. The shape of the decision region as you will see if subsequent examples may be liner or non-linear (quadratic) .

logisticregression

Classification Using Naïve Bayes

The second model we will explore is also a probabilistic model that is noted for being extremely high performance and provides good results even though it is relatively unsophisticated. With scikit-learn there are multiple Naïve Bayes algorithms available that you can experiment with. For our demo we will only look at one.

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)

pred = nb.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = nb.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

Running the sample code in which default configuration is used across the board, we wind up with a model that performs really well with our data. Out of the box, the Gaussian Naïve Bayes model matches our prior example with 97.8% accuracy. When we plot the decision regions, note the difference in the shapes that were produced.

naivebayes

Classification Using Support Vector Machines

In our third example, we will look at an implementation of a support vector machine (SVM). SVM models perform very similar to logistic regression models except that we can handle both linear and non-linear data using what’s known as the kernel trick. I have demonstrates this in the code sample below.

from sklearn.svm import SVC

svm = SVC(kernel='linear')
#svm = SVC(kernel='rbf')

svm.fit(X_train, y_train)

pred = svm.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = svm.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

svm

Classification Using Decision Trees

In our fourth example we implement a decision tree classifier. Recall from the prior blog post that a decision tree algorithm seeks to split the data set using a type of rule to maximize information gain and minimize entropy. One of the main parameters or arguments for your tree is the maximum depth (max_depth). This parameter sets the maximum number of levels the algorithm will consider before it stops. Setting the max_depth to a higher level should make the model perform better but you are likely to overfit the model to your training data and it will generally perform more poorly in the real world.

The implementation is straight-forward and like all the models we’ve looked as so far it performs really well with our data (97.8% accuracy).

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', max_depth=3)
tree.fit(X_train, y_train)

pred = tree.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = tree.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

When we plot the decision regions we can see a pretty big difference however, since the decision tree is split out data rather than attempting to fit a line, we wind up with boxes as seen below.

trees

Classification Using K-Means Clustering

The final classification algorithm we will look at is significantly different that the prior examples. K-means clustering is a form of unsupervised learning. In the prior examples, we had to fit or train our supervised models to our labelled training data. We then could use the trained model to make predictions which we did in the form of scoring the model using our held-out test data.

Clustering on the other hand, does not use labelled data. Instead it seeks to form clusters of points based on the distance . To do typical clustering we much provide at a minimum the number of clusters to form. We can cheat a little here since we know there are three classes we are trying to predict.

For each cluster a random center point or centroid is placed. The algorithm will then iterate (based on our max_iter parameter) and adjust the centroid to maximize the fit of the points to the clusters.

from sklearn.cluster import KMeans
from sklearn import metrics

km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1)
km.fit(X_train)

We can now plot our data, including the centroids to better visualize this. Note that the clusters it found are not necessarily tied to our true labels or the ground truth. When we use this model to predict or classify a point, the result will be a cluster number. It is up to you to associate the clusters back to your labels.

 

 

 Wrap-Up

In this post we used Python and scikit-learn to experiment with multiple different classification models. We looked as five implementations both supervised and unsupervised methods that are capable of handling various types of data. We still have barely even scratched the surfaced.

In the next post, we are actually going to take a step backwards as we start a discussion on data wrangling and preparation.

Till next time!

Chris

Geospatial Queries Using Hive

During one recent engagement, I was helping my customer align ETL activities that were originally developed using SQL Server and T-SQL with the capabilities that were available using Hadoop and Hive. Most of the translations where straight-forward but when asked about geospatial support in Hive I was admittedly stumped. After full disclosure to the customer that I am not a geospatial expert I began digging for an answer.

Knowing that most if not all of the query functionality available in Hive happens through the use of user-define functions (UDFs), I began my research by looking to see if someone else had already skinned the proverbial cat and I found a gold-mine. ESRI, a provider of GIS software, has not only already solved the problem, but has provided their toolkit of Hive functions to the open-source community via GitHub (get it HERE). For the remainder of this post, I will review the steps necessary to set-up and use this toolkit but also walk through a real-world demo.

Set-Up

To begin, there are two JAR files needed:

  • ersi-geometry-api.jar – contains all the base/foundational classes
  • spatial-sdk-hadoop.jar – contains UDF and serde implementations

Since you can download these jar files already compiled from the ESRI GitHub site, I’ll forgo the particulars on building each from its source and get right into the steps required to run these on a HDInsight cluster.

  1. Upload both jar files to a container in your blob storage account. Make note of whatever folder/directory structure you use.
  2. Use the following to format the path to your the jar files:
    • wasb://<container>@<account>.blob.core.windows.net/<directory>/esri-geometry-api.jar
    • wasb://<container>@<account>.blob.core.windows.net/<directory>/spatial-sdk-hadoop.jar
  3. Using the Hive CLI or the HDInsight Query Console, declare both jar files or alternatively this can be done at the time of you spin up your cluster.
    add jar wasb://<container>@<account>.blob.core.windows.net/<directory>/esri-geometry-api.jar;
    add jar wasb://<container>@<account>.blob.core.windows.net/<directory>/spatial-sdk-hadoop.jar;
  4. For each function that you intend to use, you declare a temporary function as seen in the example. Note that Hive v0.13 supports permanent functions if you will be making wide spread use of these functions.
    create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
    create temporary function ST_Intersects as ‘com.esri.hadoop.hive.ST_Intersects';

Now your ready to test. To test these functions without a spatial data set we can use two fixed points defined by their latitude and longitude. Since I spend an inordinate amount of time on an airplane, I’ve picked a point near La Guardia International (LGA) airport in New York and a second point near Los Angeles International (LAX) in California. From these two points, we can build a line string which will then allow us to measure the geodesic distance between the two points. The query including the function definitions discussed above looks like this:

set hive.execution.engine=tez;

add jar wasb://gis@bluewaterdata.blob.core.windows.net/lib/esri-geometry-api.jar;
add jar wasb://gis@bluewaterdata.blob.core.windows.net/lib/spatial-sdk-hadoop.jar;

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_LineString as 'com.esri.hadoop.hive.ST_LineString'; 
create temporary function ST_GeodesicLengthWGS84 as 'com.esri.hadoop.hive.ST_GeodesicLengthWGS84'; 
create temporary function ST_SetSRID as 'com.esri.hadoop.hive.ST_SetSRID'; 

SELECT
    ST_GeodesicLengthWGS84(
        ST_SetSRID(
            ST_Linestring(40.773773, -73.872307, 33.944115, -118.40377),
            4326))/1000.0 * 0.62137119 AS DistanceInMiles;

image

Building a Hive Geospatial Demo

Now that I’ve whet your appetite for the subject matter, let’s turn our attention to building a more real-world demo beginning with the data. There are a number of examples floating around the interwebs that use a mythical set of Uber data. I say mythical because there are bigfoot like spotting of it but it appears as though the data set has been removed from the public domain and is no longer available.

That being said a suitable substitute is the NYC Taxi Trip data made available by NYC Taxi Commission and Chris Whong via a FOIL request (read more about it and get the data here). This data set consists of two parts one with trip information (pick-up/drop-off date time and locations) and the second with fare and payment information. To build/run the demo below you will need to download the data from one of its many free sources and then upload it a blob storage account connected to your HDInsight cluster. Now spin up an appropriate sized cluster and let’ go.

First let’s create two external tables over top of the data file:

DROP TABLE IF EXISTS TABLE fares;
CREATE EXTERNAL TABLE fares (
medallion STRING,
hack_license STRING,
vendor_id STRING,
pickup_datetime STRING,
payment_type STRING,
fare_amount FLOAT,
surcharge FLOAT,
mta_tax FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
total_amount FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'wasb://<container>@<account>.blob.core.windows.net/fares/'
TBLPROPERTIES ("skip.header.line.count"="1");

DROP TABLE IF EXISTS trips;
CREATE EXTERNAL TABLE trips (
medallion STRING,
hack_license STRING,
vendor_id STRING,
rate_code STRING,
store_and_fwd_flag STRING,
pickup_datetime STRING,
dropoff_datetime STRING,
passenger_count INT,
trip_time_in_secs INT,
trip_distance FLOAT,
pickup_longitude STRING,
pickup_latitude STRING,
dropoff_longitude STRING,
dropoff_latitude STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'wasb://<container>@<account>.blob.core.windows.net/trips/'
TBLPROPERTIES ("skip.header.line.count"="1");

Once the tables are in place, verify by issuing a few select statements using the LIMIT clause to verify that everything is primed and ready to go. With the set-up out of the way let’s turn our attention to a use-case.

The narrative for this demo is straight-forward:

The Contoso Taxi company is exploring new business opportunities and would like to analyze trips that either originated or terminated at La Guardia International Airport during the last year.

The primarily challenge as far as this post is concerned is indentifying which trips are in scope for the purpose of analysis.

To start, the data for this time period is moderately substantial with about 2gb of data generated each month which makes using Hive beneficial since we can leverage it distributed nature for the purpose of processing. Complicating matters however it the trip data that only contains a latitude/longitude for both pickup and drop-off locations. So how can we go about solving this problem in Hive using the ESRI Geospatial UDFs?

First, I used one of the many free mapping tools found on the interwebs to define a polygon that encapsulates La Guardia. This polygon is defined by a series of lat/long points and is represented in well-known text (WKT) format as seen below.

POLYGON ((40.773506 -73.889508, 40.774081 -73.884028, 40.779855 -73.884919, 
40.780605 -73.879043, 40.781755 -73.879373, 40.782805 -73.878614, 
40.78143 -73.87528, 40.786004 -73.871153, 40.785379 -73.869668, 
40.785079 -73.869734, 40.783705 -73.869304, 40.78233 -73.87046, 
40.782255 -73.871648, 40.78063 -73.872804, 40.772881 -73.856991, 
40.772355 -73.856661, 40.770381 -73.858081, 40.76753 -73.852666, 
40.765155 -73.851214, 40.764905 -73.851676, 40.767105 -73.853062, 
40.768781 -73.856694, 40.76703 -73.858707, 40.766805 -73.862768, 
40.770356 -73.867753, 40.771481 -73.870228, 40.771931 -73.872143, 
40.772031 -73.873893, 40.771755 -73.876236, 40.771256 -73.877887, 
40.770505 -73.87967, 40.769606 -73.881321, 40.76853 -73.8834, 
40.767756 -73.885513, 40.76723 -73.887428, 40.767105 -73.888055, 
40.773506 -73.889508))

Using the shape data, I then can create a polygon object using the ST_Polygon UDF. Next, I use the ST_Point UDF to create a point for each trip’s pickup and drop-off location. Finally, I use ST_Contains as a predicate to find trips that either begin or end within the polygon. The resulting abbreviated query looks as follows:

SELECT *
FROM trips
WHERE 
    ST_Contains(ST_Polygon("<OMITTED FROM BREVITY>"), ST_Point(pickup_longitude, pickup_latitude))
OR 
    ST_Contains(ST_Polygon("<OMITTED FROM BREVITY>"), ST_Point(dropoff_longitude, dropoff_latitude))

Since we are interested in doing an analysis only on the trips identified, we would most like load these into a new table making whatever data transformations are needed in the process. The resulting full example including a new partitioned internal table and declarations is provided below.

DROP TABLE IF EXISTS nyc_taxi_analysis;

CREATE TABLE nyc_taxi_analysis (
    medallion STRING,
    hack_license STRING,
    vendor_id STRING,
    rate_code STRING,
    store_and_fwd_flag STRING,
    passenger_count INT,
    trip_time_in_secs INT,
    trip_distance FLOAT,
    pickup_longitude STRING,
    pickup_latitude STRING,
    dropoff_longitude STRING,
    dropoff_latitude STRING,
    trip_year INT, 
    trip_month INT, 
    trip_day INT
    trip_hour INT,
    trip_week INT,
    payment_type STRING,
    fare_amount FLOAT,
    surcharge FLOAT,
    mta_tax FLOAT,
    tip_amount FLOAT,
    tolls_amount FLOAT,
    total_amount FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS ORC;

set hive.execution.engine=tez;

add jar wasb://gis@bluewaterdata.blob.core.windows.net/lib/esri-geometry-api.jar;
add jar wasb://gis@bluewaterdata.blob.core.windows.net/lib/spatial-sdk-hadoop.jar;

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon';

INSERT OVERWRITE TABLE nyc_taxi_analysis PARTITION (year, month, day) 
SELECT
    t.medallion,
    t.hack_license,
    t.vendor_id,
    t.rate_code,
    t.store_and_fwd_flag,
    t.passenger_count,
    t.trip_time_in_secs,
    t.trip_distance,
    t.pickup_longitude,
    t.pickup_latitude,
    t.dropoff_longitude,
    t.dropoff_latitude,
    YEAR(t.pickup_datetime) AS trip_year,
    MONTH(t.pickup_datetime) AS trip_month,
    DAY(t.pickup_datetime) AS trip_day,
    HOUR(t.pickup_datetime) AS trip_hour,
    WEEKOFYEAR(t.pickup_datetime) AS trip_week,
    payment_type,
    fare_amount,
    surcharge,
    mta_tax,
    tip_amount,
    tolls_amount,
    total_amount
FROM Trips t
JOIN Fares f ON t.medallion = f.medallion
AND t.hack_license = f.hack_license
AND t.vendor_id = f.vendor_id
AND t.pickup_datetime = f.pickup_datetime
WHERE 
    ST_Contains(ST_Polygon("POLYGON ((40.773506 -73.889508, 40.774081 -73.884028, 
40.779855 -73.884919, 40.780605 -73.879043, 40.781755 -73.879373, 
40.782805 -73.878614, 40.78143 -73.87528, 40.786004 -73.871153, 
40.785379 -73.869668, 40.785079 -73.869734, 40.783705 -73.869304, 
40.78233 -73.87046, 40.782255 -73.871648, 40.78063 -73.872804, 40.772881 -73.856991, 
40.772355 -73.856661, 40.770381 -73.858081, 40.76753 -73.852666, 40.765155 -73.851214, 
40.764905 -73.851676, 40.767105 -73.853062, 40.768781 -73.856694, 40.76703 -73.858707, 
40.766805 -73.862768, 40.770356 -73.867753, 40.771481 -73.870228, 40.771931 -73.872143, 
40.772031 -73.873893, 40.771755 -73.876236, 40.771256 -73.877887, 40.770505 -73.87967, 
40.769606 -73.881321, 40.76853 -73.8834, 40.767756 -73.885513, 40.76723 -73.887428, 
40.767105 -73.888055, 40.773506 -73.889508))"), 
ST_Point(pickup_longitude, pickup_latitude))
OR 
    ST_Contains(ST_Polygon("POLYGON ((40.773506 -73.889508, 40.774081 -73.884028, 
40.779855 -73.884919, 40.780605 -73.879043, 40.781755 -73.879373, 
40.782805 -73.878614, 40.78143 -73.87528, 40.786004 -73.871153, 
40.785379 -73.869668, 40.785079 -73.869734, 40.783705 -73.869304, 
40.78233 -73.87046, 40.782255 -73.871648, 40.78063 -73.872804, 40.772881 -73.856991, 
40.772355 -73.856661, 40.770381 -73.858081, 40.76753 -73.852666, 40.765155 -73.851214, 
40.764905 -73.851676, 40.767105 -73.853062, 40.768781 -73.856694, 40.76703 -73.858707, 
40.766805 -73.862768, 40.770356 -73.867753, 40.771481 -73.870228, 40.771931 -73.872143, 
40.772031 -73.873893, 40.771755 -73.876236, 40.771256 -73.877887, 40.770505 -73.87967, 
40.769606 -73.881321, 40.76853 -73.8834, 40.767756 -73.885513, 40.76723 -73.887428, 
40.767105 -73.888055, 40.773506 -73.889508))"), 
ST_Point(dropoff_longitude, dropoff_latitude));

Running the above script on a 16-node cluster and it completed in 638 seconds (your results may vary). Query the results to sample/review and ensure everything is in order.

Wrap-Up

In this post I introduced the ESRI Geospatial UDF library and the steps needed to run in on a HDInsight cluster. I discussed building a demo on real-world data to highlight a portion of the capabilities these functions provide.  I hope you were able to take away at least a high level overview of the geospatial capabilities available within Hive and that should the need arise in a future project you can leverage these capabilities as appropriate.

Till next time!

Chris

Hello…Azure Data Factory!

Overview

The boundaries between on-premise and cloud-born data continue to blur with more and more organization moving to hybrid data landscapes. The blurring of these lines introduces a number of complexities from data locality, to scale and even the diversity of the data which we must consume and processed. To meet these challenges head-on, Microsoft introduced a new managed cloud service known as Azure Data Factory (ADF).

Four Pillars of Azure Data Factory

So what is Azure Data Factory? That question is best answered by describing the four (current) pillars of ADF.

  1. Moving on-premise data to the cloud (and cloud data to on-premise)One of the primary use-cases of Azure Data Factory is as a pipeline to move or copy data across the on-premise/cloud boundary. Whether ingress or egress from the cloud, ADF is built to facilitate not only the orchestration of data movement but the transformations necessary to move your data to and from a variety of sources and destinations. The currently supported sources and destinations are: SQL Server (On-Premise), Azure Blobs, Azure Tables and Azure SQL databases. Please also note also that ADF can be used to move data from one cloud source to another.
  2. Processing data at scale (HDInsight)While moving data from point A to point B is useful in a number of situations, often times it is necessary to transform and enrich your data along the way. Handling this processing can be non-trivial and when scale is an issue. The Azure Data Factory can handle processing data of any size by leveraging HDInsight. Using either a cluster of your own or one that Azure Data Factory creates on demand, you can process, transform and enrich your data using C#, Pig, Hive and even MapReduce.
  3. Integrate and enrich disparate dataThe diversity of data and how you bring it all together presents interesting and unique challenges. How do you process that JSON data from your social media platform to enrich the more traditional and relational data from your on-premise CRM system? What about all that XML data provided by your partner or that old-school order management system? With Azure Data Factory, you can build both complex (and trivial)workflows, orchestrating the outputs from one pipeline into the inputs of another. For example, use one pipeline to shred that JSON data with a Pig Script, while another loads your transactional CRM data from an on-premise SQL Server into Azure blob storage. A third pipeline could then take both outputs, integrating them together using Hive or a Custom Activity written in C#.
  4. Orchestrate & Monitor data processing workflowsIf you done any amount of ETL work, you know first hand that its one thing to build a workflow and yet another to fully operationalize it. Too often in the ETL work things  like logging and monitoring are after thoughts. Not with Azure Data Factory. Whether you use the Azure Dashboard Portal or PowerShell, you not only have robust access to the details of every process or workflow in your Data Factory but out of the box alerting in the event that something does go awry.

Wrap-Up

So now that you have an idea of what the Azure Data Factory is, I suspect that some of you like me are itching to start using it. Over the next couple of posts, we will dig into the Data Factory, explore specific use cases and walk-through functional examples. If you would like to get a jump start, I’d invite you to check out the documentation and examples over at: http://azure.microsoft.com/en-us/documentation/services/data-factory/.

Lastly, keep in mind that Azure Data Factory is currently in preview. If you haven’t previously worked with any of the Azure Services in Preview that basically means that it can be will be subject to changes with little to no notice as it evolves towards a more formal GA release.

Till next time!

Chris