Classification in Python with Scikit-Learn and Pandas

Introduction

Classification is a large domain in the field of statistics and machine learning. Generally, classification can be broken down into two areas:

  1. Binary classification, where we wish to group an outcome into one of two groups.

  2. Multi-class classification, where we wish to group an outcome into one of multiple (more than two) groups.

In this post, the main focus will be on using a variety of classification algorithms across both of these domains, less emphasis will be placed on the theory behind them.

We can use libraries in Python such as scikit-learn for machine learning models, and Pandas to import data as data frames.

These can easily be installed and imported into Python with pip:

$ python3 -m pip install sklearn
$ python3 -m pip install pandas
import sklearn as sk
import pandas as pd

Binary Classification

For binary classification, we are interested in classifying data into one of two binary groups - these are usually represented as 0's and 1's in our data.

We will look at data regarding coronary heart disease (CHD) in South Africa. The goal is to use different variables such as tobacco usage, family history, ldl cholesterol levels, alcohol usage, obesity and more.

A full description of this dataset is available in the "Data" section of the Elements of Statistical Learning website.

The code below reads the data into a Pandas data frame, and then separates the data frame into a y vector of the response and an X matrix of explanatory variables:

import pandas as pd
import os

os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart = pd.read_csv('SAHeart.csv', sep=',', header=0)
heart.head()

y = heart.iloc[:,9]
X = heart.iloc[:,:9]

When running this code, just be sure to change the file system path on line 4 to suit your setup.

sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
0 160 12.00 5.73 23.11 1 49 25.30 97.20 52 1
1 144 0.01 4.41 28.61 0 55 28.87 2.06 63 1
2 118 0.08 3.48 32.28 1 52 29.14 3.81 46 0
3 170 7.50 6.41 38.03 1 51 31.99 24.26 58 1
4 134 13.60 3.50 27.78 1 60 25.99 57.34 49 1

Logistic Regression

Logistic Regression is a type of Generalized Linear Model (GLM) that uses a logistic function to model a binary variable based on any kind of independent variables.

To fit a binary logistic regression with sklearn, we use the LogisticRegression module with multi_class set to "ovr" and fit X and y.

We can then use the predict method to predict probabilities of new data, as well as the score method to get the mean prediction accuracy:

import sklearn as sk
from sklearn.linear_model import LogisticRegression
import pandas as pd
import os

os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart = pd.read_csv('SAHeart.csv', sep=',',header=0)
heart.head()

y = heart.iloc[:,9]
X = heart.iloc[:,:9]

LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X, y)
LR.predict(X.iloc[460:,:])
round(LR.score(X,y), 4)
array([1, 1])

Support Vector Machines

Support Vector Machines (SVMs) are a type of classification algorithm that are more flexible - they can do linear classification, but can use other non-linear basis functions. The following example uses a linear classifier to fit a hyperplane that separates the data into two classes:

import sklearn as sk
from sklearn import svm
import pandas as pd
import os

os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification')
heart = pd.read_csv('SAHeart.csv', sep=',',header=0)

y = heart.iloc[:,9]
X = heart.iloc[:,:9]

SVM = svm.LinearSVC()
SVM.fit(X, y)
SVM.predict(X.iloc[460:,:])
round(SVM.score(X,y), 4)
array([0, 1])

Random Forests

Random Forests are an ensemble learning method that fit multiple Decision Trees on subsets of the data and average the results. We can again fit them using sklearn, and use them to predict outcomes, as well as get mean prediction accuracy:

import sklearn as sk
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
RF.fit(X, y)
RF.predict(X.iloc[460:,:])
round(RF.score(X,y), 4)
0.7338

Neural Networks

Neural Networks are a machine learning algorithm that involves fitting many hidden layers used to represent neurons that are connected with synaptic activation functions. These essentially use a very simplified model of the brain to model and predict data.

We use sklearn for consistency in this post, however libraries such as Tensorflow and Keras are more suited to fitting and customizing neural networks, of which there are a few varieties used for different purposes:

import sklearn as sk
from sklearn.neural_network import MLPClassifier

NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
NN.fit(X, y)
NN.predict(X.iloc[460:,:])
round(NN.score(X,y), 4)
0.6537

Multi-Class Classification

While binary classification alone is incredibly useful, there are times when we would like to model and predict data that has more than two classes. Many of the same algorithms can be used with slight modifications.

Additionally, it is common to split data into training and test sets. This means we use a certain portion of the data to fit the model (the training set) and save the remaining portion of it to evaluate to the predictive accuracy of the fitted model (the test set).

There's no official rule to follow when deciding on a split proportion, though in most cases you'd want about 70% to be dedicated for the training set and around 30% for the test set.

To explore both multi-class classifications, as well as training/test data, we will look at another dataset from the Elements of Statistical Learning website. This is data used to determine which one of eleven vowel sounds were spoken:

import pandas as pd

vowel_train = pd.read_csv('vowel.train.csv', sep=',', header=0)
vowel_test = pd.read_csv('vowel.test.csv', sep=',', header=0)

vowel_train.head()

y_tr = vowel_train.iloc[:,0]
X_tr = vowel_train.iloc[:,1:]

y_test = vowel_test.iloc[:,0]
X_test = vowel_test.iloc[:,1:]
y x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10
0 1 -3.639 0.418 -0.670 1.779 -0.168 1.627 -0.388 0.529 -0.874 -0.814
1 2 -3.327 0.496 -0.694 1.365 -0.265 1.933 -0.363 0.510 -0.621 -0.488
2 3 -2.120 0.894 -1.576 0.147 -0.707 1.559 -0.579 0.676 -0.809 -0.049
3 4 -2.287 1.809 -1.498 1.012 -1.053 1.060 -0.567 0.235 -0.091 -0.795
4 5 -2.598 1.938 -0.846 1.062 -1.633 0.764 0.394 -0.150 0.277 -0.396

We will now fit models and test them as is normally done in statistics/machine learning: by training them on the training set and evaluating them on the test set.

Additionally, since this is multi-class classification, some arguments will have to be changed within each algorithm:

import pandas as pd
import sklearn as sk
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

vowel_train = pd.read_csv('vowel.train.csv', sep=',',header=0)
vowel_test = pd.read_csv('vowel.test.csv', sep=',',header=0)

y_tr = vowel_train.iloc[:,0]
X_tr = vowel_train.iloc[:,1:]

y_test = vowel_test.iloc[:,0]
X_test = vowel_test.iloc[:,1:]

LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_tr, y_tr)
LR.predict(X_test)
round(LR.score(X_test,y_test), 4)

SVM = svm.SVC(decision_function_shape="ovo").fit(X_tr, y_tr)
SVM.predict(X_test)
round(SVM.score(X_test, y_test), 4)

RF = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0).fit(X_tr, y_tr)
RF.predict(X_test)
round(RF.score(X_test, y_test), 4)

NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(150, 10), random_state=1).fit(X_tr, y_tr)
NN.predict(X_test)
round(NN.score(X_test, y_test), 4)
0.5455

Although the implementations of these models were rather naive (in practice there are a variety of parameters that can and should be varied for each model), we can still compare the predictive accuracy across the models. This will tell us which one is the most accurate for this specific training and test dataset:

Model Predictive Accuracy
Logistic Regression 46.1%
Support Vector Machine 64.07%
Random Forest 57.58%
Neural Network 54.55%

This shows us that for the vowel data, an SVM using the default radial basis function was the most accurate.

Conclusion

To summarize this post, we began by exploring the simplest form of classification: binary. This helped us to model data where our response could take one of two states.

We then moved further into multi-class classification, when the response variable can take any number of states.

We also saw how to fit and evaluate models with training and test sets. Furthermore, we could explore additional ways to refine model fitting among various algorithms.

Author image