Introduction
Classification is a large domain in the field of statistics and machine learning. Generally, classification can be broken down into two areas:
-
Binary classification, where we wish to group an outcome into one of two groups.
-
Multi-class classification, where we wish to group an outcome into one of multiple (more than two) groups.
In this post, the main focus will be on using a variety of classification algorithms across both of these domains, less emphasis will be placed on the theory behind them.
We can use libraries in Python such as Scikit-Learn for machine learning models, and Pandas to import data as data frames.
These can easily be installed and imported into Python with pip
:
$ python3 -m pip install sklearn
$ python3 -m pip install pandas
import sklearn as sk
import pandas as pd
Binary Classification
For binary classification, we are interested in classifying data into one of two binary groups - these are usually represented as 0's and 1's in our data.
We will look at data regarding coronary heart disease (CHD) in South Africa. The goal is to use different variables such as tobacco usage, family history, ldl cholesterol levels, alcohol usage, obesity and more.
A full description of this dataset is available in the "Data" section of the Elements of Statistical Learning website.
The code below reads the data into a Pandas data frame, and then separates the data frame into a y
vector of the response and an X
matrix of explanatory variables:
import pandas as pd
import os
os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart = pd.read_csv('SAHeart.csv', sep=',', header=0)
heart.head()
y = heart.iloc[:,9]
X = heart.iloc[:,:9]
When running this code, just be sure to change the file system path on line 4 to suit your setup.
sbp | tobacco | ldl | adiposity | famhist | typea | obesity | alcohol | age | chd | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 160 | 12.00 | 5.73 | 23.11 | 1 | 49 | 25.30 | 97.20 | 52 | 1 |
1 | 144 | 0.01 | 4.41 | 28.61 | 0 | 55 | 28.87 | 2.06 | 63 | 1 |
2 | 118 | 0.08 | 3.48 | 32.28 | 1 | 52 | 29.14 | 3.81 | 46 | 0 |
3 | 170 | 7.50 | 6.41 | 38.03 | 1 | 51 | 31.99 | 24.26 | 58 | 1 |
4 | 134 | 13.60 | 3.50 | 27.78 | 1 | 60 | 25.99 | 57.34 | 49 | 1 |
Logistic Regression
Logistic Regression is a type of Generalized Linear Model (GLM) that uses a logistic function to model a binary variable based on any kind of independent variables.
To fit a binary logistic regression with sklearn
, we use the LogisticRegression module with multi_class
set to "ovr"
and fit X
and y
.
We can then use the predict
method to predict probabilities of new data, as well as the score
method to get the mean prediction accuracy:
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import pandas as pd
import os
os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart = pd.read_csv('SAHeart.csv', sep=',',header=0)
heart.head()
y = heart.iloc[:,9]
X = heart.iloc[:,:9]
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X, y)
LR.predict(X.iloc[460:,:])
round(LR.score(X,y), 4)
array([1, 1])
Support Vector Machines
Support Vector Machines (SVMs) are a type of classification algorithm that are more flexible - they can do linear classification, but can use other non-linear basis functions. The following example uses a linear classifier to fit a hyperplane that separates the data into two classes:
import sklearn as sk
from sklearn import svm
import pandas as pd
import os
os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart = pd.read_csv('SAHeart.csv', sep=',',header=0)
y = heart.iloc[:,9]
X = heart.iloc[:,:9]
SVM = svm.LinearSVC()
SVM.fit(X, y)
SVM.predict(X.iloc[460:,:])
round(SVM.score(X,y), 4)
array([0, 1])
Random Forests
Random Forests are an ensemble learning method that fit multiple Decision Trees on subsets of the data and average the results. We can again fit them using sklearn
, and use them to predict outcomes, as well as get mean prediction accuracy:
import sklearn as sk
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
RF.fit(X, y)
RF.predict(X.iloc[460:,:])
round(RF.score(X,y), 4)
0.7338
Neural Networks
Neural Networks are a machine learning algorithm that involves fitting many hidden layers used to represent neurons that are connected with synaptic activation functions. These essentially use a very simplified model of the brain to model and predict data.
We use sklearn
for consistency in this post, however libraries such as TensorFlow and Keras are more suited to fitting and customizing neural networks, of which there are a few varieties used for different purposes:
import sklearn as sk
from sklearn.neural_network import MLPClassifier
NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
NN.fit(X, y)
NN.predict(X.iloc[460:,:])
round(NN.score(X,y), 4)
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
0.6537
Multi-Class Classification
While binary classification alone is incredibly useful, there are times when we would like to model and predict data that has more than two classes. Many of the same algorithms can be used with slight modifications.
Additionally, it is common to split data into training and test sets. This means we use a certain portion of the data to fit the model (the training set) and save the remaining portion of it to evaluate to the predictive accuracy of the fitted model (the test set).
There's no official rule to follow when deciding on a split proportion, though in most cases you'd want about 70% to be dedicated for the training set and around 30% for the test set.
To explore both multi-class classifications, as well as training/test data, we will look at another dataset from the Elements of Statistical Learning website. This is data used to determine which one of eleven vowel sounds were spoken:
import pandas as pd
vowel_train = pd.read_csv('vowel.train.csv', sep=',', header=0)
vowel_test = pd.read_csv('vowel.test.csv', sep=',', header=0)
vowel_train.head()
y_tr = vowel_train.iloc[:,0]
X_tr = vowel_train.iloc[:,1:]
y_test = vowel_test.iloc[:,0]
X_test = vowel_test.iloc[:,1:]
y | x.1 | x.2 | x.3 | x.4 | x.5 | x.6 | x.7 | x.8 | x.9 | x.10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | -3.639 | 0.418 | -0.670 | 1.779 | -0.168 | 1.627 | -0.388 | 0.529 | -0.874 | -0.814 |
1 | 2 | -3.327 | 0.496 | -0.694 | 1.365 | -0.265 | 1.933 | -0.363 | 0.510 | -0.621 | -0.488 |
2 | 3 | -2.120 | 0.894 | -1.576 | 0.147 | -0.707 | 1.559 | -0.579 | 0.676 | -0.809 | -0.049 |
3 | 4 | -2.287 | 1.809 | -1.498 | 1.012 | -1.053 | 1.060 | -0.567 | 0.235 | -0.091 | -0.795 |
4 | 5 | -2.598 | 1.938 | -0.846 | 1.062 | -1.633 | 0.764 | 0.394 | -0.150 | 0.277 | -0.396 |
We will now fit models and test them as is normally done in statistics/machine learning: by training them on the training set and evaluating them on the test set.
Additionally, since this is multi-class classification, some arguments will have to be changed within each algorithm:
import pandas as pd
import sklearn as sk
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
vowel_train = pd.read_csv('vowel.train.csv', sep=',',header=0)
vowel_test = pd.read_csv('vowel.test.csv', sep=',',header=0)
y_tr = vowel_train.iloc[:,0]
X_tr = vowel_train.iloc[:,1:]
y_test = vowel_test.iloc[:,0]
X_test = vowel_test.iloc[:,1:]
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_tr, y_tr)
LR.predict(X_test)
round(LR.score(X_test,y_test), 4)
SVM = svm.SVC(decision_function_shape="ovo").fit(X_tr, y_tr)
SVM.predict(X_test)
round(SVM.score(X_test, y_test), 4)
RF = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0).fit(X_tr, y_tr)
RF.predict(X_test)
round(RF.score(X_test, y_test), 4)
NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(150, 10), random_state=1).fit(X_tr, y_tr)
NN.predict(X_test)
round(NN.score(X_test, y_test), 4)
0.5455
Although the implementations of these models were rather naive (in practice there are a variety of parameters that can and should be varied for each model), we can still compare the predictive accuracy across the models. This will tell us which one is the most accurate for this specific training and test dataset:
Model | Predictive Accuracy |
---|---|
Logistic Regression | 46.1% |
Support Vector Machine | 64.07% |
Random Forest | 57.58% |
Neural Network | 54.55% |
This shows us that for the vowel data, an SVM using the default radial basis function was the most accurate.
Conclusion
To summarize this post, we began by exploring the simplest form of classification: binary. This helped us to model data where our response could take one of two states.
We then moved further into multi-class classification, when the response variable can take any number of states.
We also saw how to fit and evaluate models with training and test sets. Furthermore, we could explore additional ways to refine model fitting among various algorithms.