Implementing SVM and Kernel SVM with Python's Scikit-Learn

Introduction

This guide is the first part of three guides about Support Vector Machines (SVMs). In this series, we will work on a forged bank notes use case, learn about the simple SVM, then about SVM hyperparameters and, finally, learn a concept called the kernel trick and explore other types of SVMs.

If you wish to read all the guides or see which ones interests you the most, below is the table of topics covered in each guide:

1. Implementing SVM and Kernel SVM with Python's Scikit-Learn

  • Use case: forget bank notes
  • Background of SVMs
  • Simple (Linear) SVM Model
    • About the Dataset
    • Importing the Dataset
    • Exploring the Dataset
  • Implementing SVM with Scikit-Learn
    • Dividing Data into Train/Test Sets
    • Training the Model
    • Making Predictions
    • Evaluating the Model
    • Interpreting Results
  1. Understanding SVM Hyperparameters
  • The C Hyperparameter
  • The Gamma Hyperparameter
  1. Implementing Other SVM Flavors with Python's Scikit-Learn
  • The General Idea of SVMs (a recap)
  • Kernel (trick) SVM
  • Implementing non-linear kernel SVM with Scikit-Learn
  • Importing libraries
    • Importing the dataset
    • Dividing data into features (X) and target (y)
    • Dividing Data into Train/Test Sets
    • Training the Algorithm
  • Polynomial kernel
    • Making Predictions
    • Evaluating the Algorithm
  • Gaussian kernel
    • Prediction and Evaluation
  • Sigmoid Kernel
    • Prediction and Evaluation
  • Comparison of Non-Linear Kernel Performances

Use Case: Forged Bank Notes

Sometimes people find a way to forge bank notes. If there is a person looking at those notes and verifying their validity, it might be hard to be deceived by them.

But what happens when there isn't a person to look at each note? Is there a way to automatically know if bank notes are forged or real?

There are many ways to answer those questions. One answer is to photograph each received note, compare its image with a forged note's image, and then classify it as real or forged. Once it might be tedious or critical to wait for the note's validation, it would also be interesting to do that comparison quickly.

Since images are being used, they can be compacted, reduced to grayscale, and have their measurements extracted or quantized. In this way, the comparison would be between images measurements, instead of each image's pixel.

So far, we've found a way to process and compare bank notes, but how will they be classified into real or forged? We can use machine learning to do that classification. There is a classification algorithm called Support Vector Machine, mainly known by its abbreviated form: SVM.

Background of SVMs

SVMs were introduced initially in 1968, by Vladmir Vapnik and Alexey Chervonenkis. At that time, their algorithm was limited to the classification of data that could be separated using just one straight line, or data that was linearly separable. We can see how that separation would look like:

In the above image we have a line in the middle, to which some points are to the left, and others are to the right of that line. Notice that both groups of points are perfectly separated, there are no points in between or even close to the line. There seems to be a margin between similar points and the line that divides them, that margin is called separation margin. The function of the separation margin is to make the space between the similar points and the line that divides them bigger. SVM does that by using some points and calculates its perpendicular vectors to support the decision for the line's margin. Those are the support vectors that are part of the name of the algorithm. We will understand more about them later. And the straight line that we see in the middle is found by methods that maximize that space between the line and the points, or that maximize the separation margin. Those methods originate from the field of Optimization Theory.

In the example we've just seen, both groups of points can be easily separated, since each individual point is close together to its similar points, and the two groups are far from each other.

But what happens if there is not a way to separate the data using one straight line? If there are messy out of place points, or if a curve is needed?

To solve that problem, SVM was later refined in the 1990s to be able to also classify data that had points that were far from its central tendency, such as outliers, or more complex problems that had more than two dimensions and weren't linearly separable.

What is curious is that only in recent years have SVM's become widely adopted, mainly due to their ability to achieve sometimes more than 90% of correct answers or accuracy, for difficult problems.

SVMs are implemented in a unique way when compared to other machine learning algorithms, once they are based on statistical explanations of what learning is, or on Statistical Learning Theory.

In this article, we'll see what Support Vector Machines algorithms are, the brief theory behind a support vector machine, and their implementation in Python's Scikit-Learn library. We will then move towards another SVM concept, known as Kernel SVM, or Kernel trick, and will also implement it with the help of Scikit-Learn.

Simple (Linear) SVM Model

About the Dataset

Following the example given in the introduction, we will use a dataset that has measurements of real and forged bank notes images.

When looking at two notes, our eyes usually scan them from left to right and check where there might be similarities or dissimilarities. We look for a black dot coming before a green dot, or a shiny mark that is above an illustration. This means that there is an order in which we look at the notes. If we knew there were greens and black dots, but not if the green dot was coming before the black, or if the black was coming before the green, it would be harder to discriminate between notes.

There is a similar method to what we have just described that can be applied to the bank notes images. In general terms, this method consists in translating the image's pixels into a signal, then taking into consideration the order in which each different signal happens in the image by transforming it into little waves, or wavelets. After obtaining the wavelets, there is a way to know the order in which some signal happens before another, or the time, but not exactly what signal. To know that, the image's frequencies need to be obtained. They are obtained by a method that does the decomposition of each signal, called Fourier method.

Once the time dimension is obtained through the wavelets, and the frequency dimension through Fourier method, a superimposition of time and frequency is made to see when both of them have a match, this is the convolution analysis. The convolution obtains a fit that matches the wavelets with the image's frequencies and finds out which frequencies are more prominent.

This method that involves finding the wavelets, their frequencies, and then fitting both of them, is called Wavelet transform. The wavelet transform has coefficients, and those coefficients were used to obtain the measurements we have in the dataset.

Importing the Dataset

The bank notes dataset that we are going to use in this section is the same that was used in the classification section of the decision tree tutorial.

Note: You can download the dataset here.

Let's import the data into a pandas dataframe structure, and take a look at its first five rows with the head() method.

Notice that the data is saved in a txt (text) file format, separated by commas, and it is without a header. We can reconstruct it as a table by reading it as a csv, specifying the separator as a comma, and adding the column names with the names argument.

Let's follow those three steps at once, and then look at the first five rows of the data:

import pandas as pd

data_link = "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
col_names = ["variance", "skewness", "curtosis", "entropy", "class"]

bankdata = pd.read_csv(data_link, names=col_names, sep=",", header=None)
bankdata.head()

This results in:

    variance    skewness    curtosis    entropy     class
0   3.62160     8.6661      -2.8073     -0.44699    0
1   4.54590     8.1674      -2.4586     -1.46210    0
2   3.86600     -2.6383     1.9242      0.10645     0
3   3.45660     9.5228      -4.0112     -3.59440    0
4   0.32924     -4.4552     4.5718      -0.98880    0

Note: You can also save the data locally and substitute data_link for data_path, and pass in the path to your local file.

We can see that there are five columns in our dataset, namely, variance, skewness, curtosis, entropy, and class. In the five rows, the first four columns are filled with numbers such as 3.62160, 8.6661, -2.8073 or continuous values, and the last class column has its first five rows filled with 0s, or a discrete value.

Since our objective is to predict whether a bank currency note is authentic or not, we can do that based upon the four attributes of the note:

  • variance of Wavelet Transformed image. Generally, the variance is a continuous value that measures how much the data points are close or far to the data's average value. If the points are closer to the data's average value, the distribution is closer to a normal distribution, which usually means that its values are more well distributed and somewhat easier to predict. In the current image context, this is the variance of the coefficients that result from the wavelet transform. The less variance, the closer the coefficients were to translating the actual image.

  • skewness of Wavelet Transformed image. The skewness is a continuous value that indicates the asymmetry of a distribution. If there are more values to the left of the mean, the distribution is negatively skewed, if there are more values to the right of the mean, the distribution is positively skewed, and if the mean, mode and median are the same, the distribution is symmetrical. The more symmetrical a distribution is, the closer it is to a normal distribution, also having its values more well distributed. In the present context, this is the skewness of the coefficients that result from the wavelet transform. The more symmetrical, the closer the coefficients wevariance, skewness, curtosis, entropyre to translating the actual image.

  • curtosis (or kurtosis) of Wavelet Transformed image. The kurtosis is a continuous value that, like skewness, also describes the shape of a distribution. Depending on the kurtosis coefficient (k), a distribution - when compared to the normal distribution can be more or less flat - or have more or less data in its extremities or tails. When the distribution is more spread out and flatter, it is called platykurtic; when it is less spread out and more concentrated in the middle, mesokurtic; and when the distribution is almost entirely concentrated in the middle, it is called leptokurtic. This is the same case as the variance and skewness prior cases, the more mesokurtic the distribution is, the closer the coefficients were to translating the actual image.
  • entropy of image. The entropy is also a continuous value, it usually measures the randomness or disorder in a system. In the context of an image, entropy measures the difference between a pixel and its neighboring pixels. For our context, the more entropy the coefficients have, the more loss there was when transforming the image - and the smaller the entropy, the smaller the information loss.

The fifth variable was the class variable, which probably has 0 and 1 values, that say if the note was real or forged.

We can check if the fifth column contain zeros and ones with Pandas' unique() method:

bankdata['class'].unique()

The above method returns:

array([0, 1]) 

The above method returns an array with 0 and 1 values. This means that the only values contained in our class rows are zeros and ones. It is ready to be used as the target in our supervised learning.

  • class of image. This is an integer value, it is 0 when the image is forged, and 1 when the image is real.

Since we have a column with the annotations of real and forget images, this means that our type of learning is supervised.

Advice: to know more about the reasoning behind the Wavelet Transform on the bank notes images and the use of SVM, read the published paper of the authors.

We can also see how many records, or images we have, by looking at the number of rows in the data via the shape property:

bankdata.shape

This outputs:

(1372, 5)

The above line means that there are 1,372 rows of transformed bank notes images, and 5 columns. This is the data we will be analyzing.

We have imported our dataset and made a few checks. Now we can explore our data to understand it better.

Exploring the Dataset

We've just seen that there are only zeros and ones in the class column, but we can also know in what proportion they are - in other words - if there are more zeros than ones, more ones than zeros, or if the numbers of zeros is the same as the number of ones, meaning they are balanced.

To know the proportion we can count each of the zero and one values in the data with value_counts() method:

bankdata['class'].value_counts()

This outputs:

0    762
1    610
Name: class, dtype: int64

In the result above, we can see that there are 762 zeros and 610 ones, or 152 more zeros than ones. This means that we have a little bit more forged that real images, and if that discrepancy was bigger, for instance, 5500 zeros and 610 ones, it could negatively impact our results. Once we are trying to use those examples in our model - the more examples there are, usually means that the more information the model will have to decide between forged or real notes - if there are few real notes examples, the model is prone to be mistaken when trying to recognize them.

We already know that there are 152 more forged notes, but can we be sure those are enough examples for the model to learn? Knowing how many examples are needed for learning is a very hard question to answer, instead, we can try to understand, in percentage terms, how much that difference between classes is.

The first step is to use pandas value_counts() method again, but now let's see the percentage by including the argument normalize=True:

bankdata['class'].value_counts(normalize=True)

The normalize=True calculates the percentage of the data for each class. So far, the percentage of forged (0) and real data (1) is:

0    0.555394
1    0.444606
Name: class, dtype: float64

This means that approximately (~) 56% of our dataset is forged and 44% of it is real. This gives us a 56%-44% ratio, which is the same as a 12% difference. This is statistically considered a small difference, because it is just a little above 10%, so the data is considered balanced. If instead of a 56:44 proportion, there was an 80:20 or 70:30 proportion, then our data would be considered imbalanced, and we would need to do some imbalance treatment, but, fortunately, this is not the case.

We can also see this difference visually, by taking a look at the class or target's distribution with a Pandas imbued histogram, by using:

bankdata['class'].plot.hist();

This plots a histogram using the dataframe structure directly, in combination with the matplotlib library that is behind the scenes.

By looking at the histogram, we can be sure that our target values are either 0 or 1 and that the data is balanced.

This was an analysis of the column that we were trying to predict, but what about analyzing the other columns of our data?

We can have a look at the statistical measurements with the describe() dataframe method. We can also use .T of transpose - to invert columns and rows, making it more direct to compare across values:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

bankdata.describe().T

This results in:

            count   mean        std         min         25%         50%         75%         max
variance    1372.0  0.433735    2.842763    -7.0421     -1.773000   0.49618     2.821475    6.8248
skewness    1372.0  1.922353    5.869047    -13.7731    -1.708200   2.31965     6.814625    12.9516
curtosis    1372.0  1.397627    4.310030    -5.2861     -1.574975   0.61663     3.179250    17.9274
entropy     1372.0  -1.191657   2.101013    -8.5482     -2.413450   -0.58665    0.394810    2.4495
class       1372.0  0.444606    0.497103    0.0000       0.000000   0.00000     1.000000    1.0000

Notice that skewness and kurtosis columns have mean values that are far from the standard deviation values, this indicates that those values that are further from the data's central tendency, or have a greater variability.

We can also take a peek at each feature's distribution visually, by plotting each feature's histogram inside a for loop. Besides looking at the distribution, it would be interesting to look at how the points of each class are separated regarding each feature. To do that, we can plot a scatter plot making a combination of features between them, and assign different colors to each point in regards to its class.

Let's start with each feature's distribution, and plot the histogram of each data column except for the class column. The class column will not be taken into consideration by its position in the bankdata columns array. All columns will be selected except for the last one with columns[:-1]:

import matplotlib.pyplot as plt

for col in bankdata.columns[:-1]:
    plt.title(col)
    bankdata[col].plot.hist() #plotting the histogram with Pandas
    plt.show();

After running the above code, we can see that both skewness and entropy data distributions are negatively skewed and kurtosis is positively skewed. All distributions are symmetrical, and variance is the only distribution that is close to normal.

We can now move on to the second part, and plot the scatter plot of each variable. To do this, we can also select all columns except for the class, with columns[:-1], use Seaborn's scatterplot() and two for loops to obtain the variations in pairing for each of the features. We can also exclude the pairing of a feature with itself, by testing if the first feature equals the second one with an if statement.

import seaborn as sns

for feature_1 in bankdata.columns[:-1]:
    for feature_2 in bankdata.columns[:-1]:
        if feature_1 != feature_2: # test if the features are different
            print(feature_1, feature_2) # prints features names
            sns.scatterplot(x=feature_1, y=feature_2, data=bankdata, hue='class') # plots each feature points with its color depending on the class column value
            plt.show();

Notice that all graphs have both real and forged data points not clearly separated from each other, this means there is some kind of superposition of classes. Since a SVM model uses a line to separate between classes, could any of those groups in the graphs be separated using only one line? It seems unlikely. This is what most real data looks like. The closest we can get to a separation is in the combination of skewness and variance, or entropy and variance plots. This is probably due to variance data having a distribution shape that is closer to normal.

But looking at all of those graphs in sequence can be a little hard. We have the alternative of looking at all the distribution and scatter plot graphs together by using Seaborn's pairplot().

Both previous for loops we had done can be substituted by just this line:

sns.pairplot(bankdata, hue='class');

Looking at the pair plot, it seems that, actually, kurtosis and variance would be the easiest combination of features, so the different classes could be separated by a line, or linearly separable.

If most data is far from being linearly separable, we can try to preprocess it, by reducing its dimensions, and also normalize its values to try to make the distribution closer to a normal.

For this case, let's use the data as it is, without further preprocessing, and later, we can go back one step, add to the data preprocessing and compare the results.

Advice: When working with data, information is usually lost when transforming it, because we are making approximations, instead of collecting more data. Working with the initial data first as it is, if possible, offers a baseline before trying other preprocessing techniques. When following this path, the initial result using raw data can be compared with another result that uses preprocessing techniques on the data.

Note: Usually in Statistics, when building models, it is common to follow a procedure depending on the kind of data (discrete, continuous, categorical, numerical), its distribution, and the model assumptions. While in Computer Science (CS), there is more space for trial, error and new iterations. In CS it is common to have a baseline to compare against. In Scikit-Learn, there is an implementation of dummy models (or dummy estimators), some aren't better than tossing a coin, and just answer yes (or 1) 50% of the time. It is interesting to use dummy models as a baseline for the actual model when comparing results. It is expected that the actual model results are better than a random guess, otherwise, using a machine learning model wouldn't be necessary.

Implementing SVM with Scikit-Learn

Before getting more into the theory of how SVM works, we can build our first baseline model with the data, and Scikit-Learn's Support Vector Classifier or SVC class.

Our model will receive the wavelet coefficients and try to classify them based on the class. The first step in this process is to separate the coefficients or features from the class or target. After that step, the second step is to further divide the data into a set that will be used for the model's learning or train set and another one that will be used to the model's evaluation or test set.

Note: The nomenclature of test and evaluation can be a little confusing, because you can also split your data between train, evaluation and test sets. In this way, instead of having two sets, you would have an intermediary set just to use and see if your model's performance is enhancing. This means that the model would be trained with the train set, enhanced with the evaluation set, and obtaining a final metric with the test set.

Some people say that the evaluation is that intermediary set, others will say that the test set is the intermediary set, and that the evaluation set is the final set. This is another way to try to guarantee that the model isn't seeing the same example in any way, or that some kind of data leakage isn't happening, and that there is a model generalization by the improvement of the last set metrics. If you want to follow that approach, you can further divide the data once more as described in this Scikit-Learn's train_test_split() - Training, Testing and Validation Sets guide.

Dividing Data into Train/Test Sets

In the previous session, we understood and explored the data. Now, we can divide our data in two arrays - one for the four features, and other for the fifth, or target feature. Since we want to predict the class depending on the wavelets coefficients, our y will be the class column and our X will the variance, skewness, curtosis, and entropy columns.

To separate the target and features, we can attribute only the class column to y, later dropping it from the dataframe to attribute the remaining columns to X with .drop() method:

y = bankdata['class']
X = bankdata.drop('class', axis=1) # axis=1 means dropping from the column axis

Once the data is divided into attributes and labels, we can further divide it into train and test sets. This could be done by hand, but the model_selection library of Scikit-Learn contains the train_test_split() method that allows us to randomly divide data into train and test sets.

To use it, we can import the library, call the train_test_split() method, pass in X and y data, and define a test_size to pass as an argument. In this case, we will define it as 0.20- this means 20% of the data will be used for testing, and the other 80% for training.

This method randomly takes samples respecting the percentage we've defined, but respects the X-y pairs, lest the sampling would totally mix up the relationship.

Since the sampling process is inherently random, we will always have different results when running the method. To be able to have the same results, or reproducible results, we can define a constant called SEED with the value of 42.

You can execute the following script to do so:

from sklearn.model_selection import train_test_split

SEED = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = SEED)

Notice that the train_test_split() method already returns the X_train, X_test, y_train, y_test sets in this order. We can print the number of samples separated for train and test by getting the first (0) element of the shape property returned tuple:

xtrain_samples = X_train.shape[0]
xtest_samples = X_test.shape[0]

print(f'There are {xtrain_samples} samples for training and {xtest_samples} samples for testing.')

This shows that there are 1097 samples for training and 275 for testing.

Training the Model

We have divided the data into train and test sets. Now it is time to create and train an SVM model on the train data. To do that, we can import Scikit-Learn's svm library along with the Support Vector Classifier class, or SVC class.

After importing the class, we can create an instance of it - since we are creating a simple SVM model, we are trying to separate our data linearly, so we can draw a line to divide our data - which is the same as using a linear function - by defining kernel='linear' as an argument for the classifier:

from sklearn.svm import SVC
svc = SVC(kernel='linear')

This way, the classifier will try to find a linear function that separates our data. After creating the model, let's train it, or fit it with the train data, employing the fit() method and giving the X_train features and y_train targets as arguments.

We can execute the following code in order to train the model:

svc.fit(X_train, y_train)

Just like that, the model is trained. So far, we have understood the data, divided it, created a simple SVM model, and fitted the model to the train data.

The next step is to understand how well that fit managed to describe our data. In other words, to answer if a linear SVM was an adequate choice.

Making Predictions

A way to answer if the model managed to describe the data is to calculate and look at some classification metrics.

Considering that the learning is supervised, we can make predictions with X_test and compare those prediction results - which we might call y_pred - with the actual y_test, or ground truth.

To predict some of the data, the model's predict() method can be employed. This method receives the test features, X_test, as an argument and returns a prediction, either 0 or 1, for each one of X_test's rows.

After predicting the X_test data, the results are stored in a y_pred variable. So each of the classes predicted with the simple linear SVM model are now in the y_pred variable.

This is the prediction code:

y_pred = svc.predict(X_test)

Considering we have the predictions, we can now compare them to the actual results.

Evaluating the Model

There are several ways of comparing predictions with actual results, and they measure different aspects of a classification. Some most used classification metrics are:

  1. Confusion Matrix: when we need to know how much samples we got right or wrong for each class. The values that were correct and correctly predicted are called true positives, the ones that were predicted as positives but weren't positives are called false positives. The same nomenclature of true negatives and false negatives is used for negative values;

  2. Precision: when our aim is to understand what correct prediction values were considered correct by our classifier. Precision will divide those true positive values by the samples that were predicted as positives;

$$
precision = \frac{\text{true positives}}{\text{true positives} + \text{false positives}}
$$

  1. Recall: commonly calculated along with precision to understand how many of the true positives were identified by our classifier. The recall is calculated by dividing the true positives by anything that should have been predicted as positive.

$$
recall = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}
$$

  1. F1 score: is the balanced or harmonic mean of precision and recall. The lowest value is 0 and the highest is 1. When f1-score is equal to 1, it means all classes were correctly predicted - this is a very hard score to obtain with real data (exceptions almost always exist).

$$
\text{f1-score} = 2* \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}}
$$

We have already been acquainted with confusion matrix, precision, recall, and F1 score measures. To calculate them, we can import Scikit-Learn's metrics library. This library contains the classification_report and confusion_matrix methods, the classification report method returns the precision, recall, and f1 score. Both classification_report and confusion_matrix can be readily used to find out the values for all those important metrics.

For calculating the metrics, we import the methods, call them and pass as arguments the predicted classifications, y_test, and the classification labels, or y_true.

For a better visualization of the confusion matrix, we can plot it in a Seaborn's heatmap along with quantity annotations, and for the classification report, it is best to print its outcome, so its results are formatted. This is the following code:

from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True, fmt='d').set_title('Confusion matrix of linear SVM') # fmt='d' formats the numbers as digits, which means integers

print(classification_report(y_test,y_pred))

This displays:

                precision    recall  f1-score   support

           0       0.99      0.99      0.99       148
           1       0.98      0.98      0.98       127

    accuracy                           0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275

In the classification report, we know there is a precision of 0.99, recall of 0.99 and an f1 score of 0.99 for the forged notes, or class 0. Those measurements were obtained using 148 samples as shown in the support column. Meanwhile, for class 1, or real notes, the result was one unit below, a 0.98 of precision, 0.98 of recall, and the same f1 score. This time, 127 image measurements were used for obtaining those results.

If we look at the confusion matrix, we can also see that from 148 class 0 samples, 146 were correctly classified, and there were 2 false positives, while for 127 class 1 samples, there were 2 false negatives and 125 true positives.

We can read the classification report and the confusion matrix, but what do they mean?

Interpreting Results

To find out the meaning, let's look at all the metrics combined.

Almost all the samples for class 1 were correctly classified, there were 2 mistakes for our model when identifying actual bank notes. This is the same as 0.98, or 98%, recall. Something similar can be said of class 0, only 2 samples were classified incorrectly, while 148 are true negatives, totalizing a precision of 99%.

Besides those results, all others are marking 0.99, which is almost 1, a very high metric. Most of the time, when such a high metric happens with real life data, this might be indicating a model that is over adjusted to the data, or overfitted.

When there is an overfit, the model might work well when predicting the data that is already known, but it loses the ability to generalize to new data, which is important in real world scenarios.

A quick test to find out if an overfit is happening is also with train data. If the model has somewhat memorized the train data, the metrics will be very close to 1 or 100%. Remember that the train data is larger than the test data - for this reason - try to look at it proportionally, more samples, more chances of making mistakes, unless there has been some overfit.

To predict with train data, we can repeat what we have done for test data, but now with X_train:

y_pred_train = svc.predict(X_train)

cm_train = confusion_matrix(y_train,y_pred_train)
sns.heatmap(cm_train, annot=True, fmt='d').set_title('Confusion matrix of linear SVM with train data')

print(classification_report(y_train,y_pred_train))

This outputs:

                precision    recall  f1-score   support

           0       0.99      0.99      0.99       614
           1       0.98      0.99      0.99       483

    accuracy                           0.99      1097
   macro avg       0.99      0.99      0.99      1097
weighted avg       0.99      0.99      0.99      1097

It is easy to see there seems to be an overfit, once the train metrics are 99% when having 4 times more data. What can be done in this scenario?

To revert the overfit, we can add more train observations, use a method of training with different parts of the dataset, such as cross validation, and also change the default parameters that already exist prior to training, when creating our model, or hyperparameters. Most of the time, Scikit-learn sets some parameters as default, and this can happen silently if there is not much time dedicated to reading the documentation.

You can check the second part of this guide (coming soon!) to see how to implement cross validation and perform a hyperparameter tuning.

Conclusion

In this article we studied the simple linear kernel SVM. We got the intuition behind the SVM algorithm, used a real dataset, explored the data, and saw how this data can be used along with SVM by implementing it with Python's Scikit-Learn library.

To keep practicing, you can try to other real-world datasets available at places like Kaggle, UCI, Big Query public datasets, universities, and government websites.

I would also suggest that you explore the actual mathematics behind the SVM model. Although you are not necessarily going to need it in order to use the SVM algorithm, it is still very handy to know what is actually going on behind the scenes while your algorithm is finding decision boundaries.

If you wish to keep learning about SVMs, you can go to the second part of this series, Understanding SVM Hyperparameters.

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Cássia SampaioAuthor

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

© 2013-2025 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms