Continuing this series of Data Science from a practitioners perspective…

In part one of this series we introduced the very basics to one part of machine learning: predictive analytics. In this post, we will begin to dig a little deeper as we dive in and explore some of the more common binary and multiclass classification techniques beginning with an emphasis on understanding the very basics of model selection.

As a quick refresher, recall that classification is one way we can make predictions using supervised learning. If we think in terms of the canonical e-commerce or digital marketing examples we can quickly come up with a couple of scenarios we might want to predict:

- Scenario 1: Whether a customer/website visitor will purchase our product or will click on a link
- Scenario 2: Whether an economy or premium product is best suited to our customer

In both of these scenarios we can consider these tasks of classification. Binary classification simply means that we only have two classes or possible states (yes/no, true/false, etc.) we are trying to predict and this typically fits the first scenario.

Multiclass classification then is just an extension of binary classification and involves predicting membership in one of multiple possible classes (i.e.: free, economy, standard, premium) as seen in the second scenario. Most algorithms handle both binary and multiclass classifications using a one-versus-all (OvA) approach in which a classifier for each class is trained and then compared to one another to determine the “best” class.

### Understanding model selection

Within the classification space there are a number of mathematical, statistical and machine learning techniques we can apply to accomplish the task at hand. Choosing the appropriate one involves understanding your data, its features and the nuances of each model.

All of the models we are concerned with begin with your data. In supervised learning models, the model you select is iteratively trained. In each iteration a cost or error function is used to evaluate the model. When a defined optimum state or the maximum number of iterations have been reach, the training process is completed. So why does it matter?

The key to this process if your data. More specifically the features (attributes or variable) and examples (rows) within your data. Different models are capable of not only handling different data types but perform differently based on the number of features within your data. For now we won’t concern ourselves with data wrangling, preparation, cleansing, feature selection, etc. as those will be topics of future posts. So about your data….

Raw features appear is a couple of different forms. First and most obvious is a numeric. Numeric features as the name implies are simply numbers that can either be discrete or continuous in nature. Examples of numeric features can include attributes such as age, mileage, income or as we will see in the upcoming examples length and width of flower petals.

The second type of features you will likely encounter in your data is called categorical or nominal. Examples of this type of feature include things like gender, ethnicity, color, and size. Within this type a special sub-type exists to be aware of, the ordinal. Ordinal features are simply categorical features where the order is potentially important. Examples of include sizes (i.e. Small, Medium, Large) or product/customer review ratings.

I’d be remiss if I didn’t briefly discuss the need to cleanse and prep your data. Many of the models we will explore require data preparation steps such as feature scaling to perform their best. Again, I will defer on discussing this until a later post.

So we know we have to consider the type of data we are working with, what else…

### Data Shape and Linear, Non-Linearly Separation…Say what?!?

If we generalize, we can think of the each example within our data as a vector (array) of features. To visualize this, you could take each vector and then plot it on a graph or chart. Once your data is plotted, think of the model you choose as being responsible for trying to separate the chart along the charts planes. This gets really hard/impossible with many dimensions (i.e. lots of features) so we will stick with just two-features or a two dimensional example.

If we think in two-dimensional terms, we can easily visualize linear and non-linearly separable data. In the diagrams below, I have generated two plots one linear and the second non-linear. Imagine now you must construct a line to separate the data points.

Clearly the diagram on the left is linear since we can easily divide the plot with a line (drawn yellow line). The diagram on the right, however with the moon shapes is illustrative of non-linear data and it cannot be accurately separated with a line. So how why does this matter in model selection?

Simple. Some models are designed for and work best when data is clearly linearly separable, others work on non-linear data, I will spare you the math and we will look at explore examples of both.

### So about the models…

Now that we’ve have context for what some of the factors that come into play for model selection, let’s very, very briefly look at non-math introduction to some of the most common models/techniques.

**Linear Logistic Regression**– handles linear data and estimates probabilities (using a logistic function) that a data point is associated with a class. The features for this model must be numeric but it is common to convert categorical features into numeric representations.**Naïve Bayes**– also uses probabilities for classification and is based on Bayes theorem with a strong (Naïve) assumption of independence between features. Frequently used for text classification (i.e. spam not spam) and performs well will large feature sets and is highly scalable.**Support Vector Machines (SVM)**– is a non-probabilistic model that is capable of handling both linear and non-linear data. SVM works by trying to optimize/maximize the space between class or categories.**Decision Trees**– most people are familiar with the graphical tree like representation of decision trees. As decision tree is typically by using an algorithm that seeks to maximize information gain and minimize what’s known as entropy. Entropy measures the homogeneity of a split, for example a split that results in a pure class (i.e. all purchasers) would have an entropy of 0. A split that results in 50/50 split (i.e. purchasers and non-purchasers) would have an entropy of 1. The information gain is simply based on the decrease in entropy after a split occurs. Decision trees are flexible and can handle categorical and numerical data and works on both linear and non-linear data. They are also very transparent in how decisions are made.**Neural Networks**– modelled after the human brain where multiple layers of neurons exchange messages as part of the process to approximate functions. Neural networks are largely black box and can solve extremely tough classification problems that exists within the fields of computer vision and speech recognition.

### Wrap-Up

In this post, we discussed the basics of model selection as you prepare to get started in a classification exercise. Start by developing a deep understanding of the data you are working with. Remember this is just a starting point and often times, you will want to compare a couple of different approaches to determine which will provide the most utility as we will see in the post.

One of my favorite sayings is that “All models are wrong, but some are useful…” by George Edward Pelham Box. Keep this is mind as you begin your journey as a data science practitioner.

In our next post, we continue the focus on most common classification models using samples to illustrate and understand each of the various techniques.

Till next time!

Chris