Definitive Guide to Logistic Regression in Python

Introduction

Sometimes confused with linear regression by novices - due to sharing the term regression - logistic regression is far different from linear regression. While linear regression predicts values such as 2, 2.45, 6.77 or continuous values, making it a regression algorithm, logistic regression predicts values such as 0 or 1, 1 or 2 or 3, which are discrete values, making it a classification algorithm. Yes, it's called regression but is a classification algorithm. More on that in a moment.

Therefore, if your data science problem involves continuous values, you can apply a regression algorithm (linear regression is one of them). Otherwise, if it involves classifying inputs, discrete values, or classes, you can apply a classification algorithm (logistic regression is one of them).

In this guide, we'll be performing logistic regression in Python with the Scikit-Learn library. We will also explain why the word "regression" is present in the name and how logistic regression works.

To do that, we will first load data that will be classified, visualized, and pre-processed. Then, we will build a logistic regression model that will understand that data. This model will then be evaluated, and employed to predict values based on new input.

Motivation

The company you work for did a partnership with a Turkish agricultural farm. This partnership involves selling pumpkin seeds. Pumpkin seeds are very important for human nutrition. They contain a good proportion of carbohydrates, fat, protein, calcium, potassium, phosphorus, magnesium, iron, and zinc.

In the data science team, your task is to tell the difference between the types of pumpkin seeds just by using data - or classifying the data according to seed type.

The Turkish farm works with two pumpkin seed types, one is called Çerçevelik and the other Ürgüp Sivrisi.

To classify the pumpkin seeds, your team has followed the 2021 paper "The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.). Genetic Resources and Crop Evolution" from Koklu, Sarigil, and Ozbek - in this paper, there is a methodology for photographing and extracting the seeds measurements from the images.

After completing the process described in the paper, the following measurements were extracted:

  • Area - the number of pixels within the borders of a pumpkin seed
  • Perimeter - the circumference in pixels of a pumpkin seed
  • Major Axis Length - also the circumference in pixels of a pumpkin seed
  • Minor Axis Length - the small axis distance of a pumpkin seed
  • Eccentricity - the eccentricity of a pumpkin seed
  • Convex Area - the number of pixels of the smallest convex shell at the region formed by the pumpkin seed
  • Extent - the ratio of a pumpkin seed area to the bounding box pixels
  • Equivalent Diameter - the square root of the multiplication of the area of the pumpkin seed by four divided by pi
  • Compactness - the proportion of the area of the pumpkin seed relative to the area of the circle with the same circumference
  • Solidity - the convex and convex condition of the pumpkin seeds
  • Roundness - the ovality of pumpkin seeds without considering its edges distortions
  • Aspect Ratio - the aspect ratio of the pumpkin seeds

Those are the measurements you have to work with. Besides the measurements, there is also the Class label for the two types of pumpkin seeds.

To start classifying the seeds, let's import the data and begin to look at it.

Understanding the Dataset

Note: You can download the pumpkin dataset here.

After downloading the dataset, we can load it into a dataframe structure using the pandas library. Since it is an excel file, we will use the read_excel() method:

import pandas as pd

fpath = 'dataset/pumpkin_seeds_dataset.xlsx' # substitute with the path to your Excel file
df = pd.read_excel(fpath)

Once the data is loaded in, we can take a quick peek at the first 5 rows using the head() method:

df.head() 

This results in:

    Area 	Perimeter 	Major_Axis_Length 	Minor_Axis_Length 	Convex_Area 	Equiv_Diameter 	Eccentricity 	Solidity 	Extent 	Roundness 	Aspect_Ration 	Compactness 	Class
0 	56276 	888.242 	326.1485 			220.2388 			56831 			267.6805 		0.7376 			0.9902 		0.7453 	0.8963 		1.4809 			0.8207 			Çerçevelik
1 	76631 	1068.146 	417.1932 			234.2289 			77280 			312.3614 		0.8275 			0.9916 		0.7151 	0.8440 		1.7811 			0.7487 			Çerçevelik
2 	71623 	1082.987 	435.8328 			211.0457 			72663 			301.9822 		0.8749 			0.9857 		0.7400 	0.7674 		2.0651 			0.6929 			Çerçevelik
3 	66458 	992.051 	381.5638 			222.5322 			67118 			290.8899 		0.8123 			0.9902 		0.7396 	0.8486 		1.7146 			0.7624 			Çerçevelik
4 	66107 	998.146 	383.8883 			220.4545 			67117 			290.1207 		0.8187 			0.9850 		0.6752 	0.8338 		1.7413 			0.7557 			Çerçevelik

Here, we have all the measurements in their respective columns, our features, and also the Class column, our target, which is the last one in the dataframe. We can see how many measurements we have using the shape attribute:

df.shape 

The output is:

(2500, 13)

The shape result tells us that there are 2500 entries (or rows) in the dataset and 13 columns. Since we know there is one target column - this means we have 12 feature columns.

We can now explore the target variable, the pumpkin seed Class. Since we will predict that variable, it is interesting to see how many samples of each pumpkin seed we have. Usually, the smaller the difference between the number of instances in our classes, the more balanced our sample is and the better our predictions.

This inspection can be done by counting each seed sample with the value_counts() method:

df['Class'].value_counts() 

The above code displays:

Çerçevelik       1300
Ürgüp Sivrisi    1200
Name: Class, dtype: int64

We can see that there are 1300 samples of the Çerçevelik seed and 1200 samples of the Ürgüp Sivrisi seed. Notice that the difference between them is 100 samples, a very small difference, which is good for us and indicates there is no need to rebalance the number of samples.

Let's also look at the descriptive statistics of our features with the describe() method to see how well distributed the data is. We will also transpose the resulting table with T to make it easier to compare across statistics:

df.describe().T

The resulting table is:

                    count 	mean 			std 			min 		25% 			50% 			75% 			max
Area 				2500.0 	80658.220800 	13664.510228 	47939.0000 	70765.000000 	79076.00000 	89757.500000 	136574.0000
Perimeter 			2500.0 	1130.279015 	109.256418 		868.4850 	1048.829750 	1123.67200 		1203.340500 	1559.4500
Major_Axis_Length 	2500.0 	456.601840 		56.235704 		320.8446 	414.957850 		449.49660 		492.737650 		661.9113
Minor_Axis_Length 	2500.0 	225.794921 		23.297245 		152.1718 	211.245925 		224.70310 		240.672875 		305.8180
Convex_Area 		2500.0 	81508.084400 	13764.092788 	48366.0000 	71512.000000 	79872.00000 	90797.750000 	138384.0000
Equiv_Diameter 		2500.0 	319.334230 		26.891920 		247.0584 	300.167975 		317.30535 		338.057375 		417.0029
Eccentricity 		2500.0 	0.860879 		0.045167 		0.4921 		0.831700 		0.86370 		0.897025 		0.9481
Solidity 			2500.0 	0.989492 		0.003494 		0.9186 		0.988300 		0.99030 		0.991500 		0.9944
Extent 				2500.0 	0.693205 		0.060914 		0.4680 		0.658900 		0.71305 		0.740225 		0.8296
Roundness 			2500.0 	0.791533 		0.055924 		0.5546 		0.751900 		0.79775 		0.834325 		0.9396
Aspect_Ration 		2500.0 	2.041702 		0.315997 		1.1487 		1.801050 		1.98420 		2.262075 		3.1444
Compactness 		2500.0 	0.704121 		0.053067 		0.5608 		0.663475 		0.70770 		0.743500 		0.9049

By looking at the table, when comparing the mean and standard deviation (std) columns, it can be seen that most features have a mean that is far from the standard deviation. That indicates that the data values aren't concentrated around the mean value, but more scattered around it - in other words, they have high variability.

Also, when looking at the minimum (min) and maximum (max) columns, some features, such as Area, and Convex_Area, have big differences between minimum and maximum values. This means that those columns have very small data and also very large data values, or higher amplitude between data values.

With high variability, high amplitude, and features with different measurement units, most of our data would benefit from having the same scale for all features or being scaled. Data scaling will center data around the mean and reduce its variance.

This scenario probably also indicates that there are outliers and extreme values in data. So, it is best to have some outlier treatment besides scaling the data.

There are some machine learning algorithms, for instance, tree-based algorithms such as Random Forest Classification, that aren't affected by high data variance, outliers, and extreme values. Logistic regression is different, it is based on a function that categorizes our values, and the parameters of that function can be affected by values that are out of the general data trend and have high variance.

Advice: If you'd like to read more about feature scaling - read our "Feature Scaling Data with Scikit-Learn for Machine Learning"!

We will understand more about logistic regression in a bit when we get to implement it. For now, we can keep exploring our data.

Note: There is a popular saying in Computer Science: "Garbage in, garbage out" (GIGO), that is well suited for machine learning. This means that when we have garbage data - measurements that don't describe the phenomena in themselves, data that wasn't understood and well prepared according to the kind of algorithm or model, will likely generate an incorrect output that won't work on a day to day basis.
This is one of the reasons why exploring, understanding data, and how the chosen model works are so important. By doing that, we can avoid putting garbage in our model - putting value in it instead, and getting value out.

Visualizing the Data

Up until now, with the descriptive statistics, we have a somewhat abstract snapshot of some qualities of the data. Another important step is to visualize it and confirm our hypothesis of high variance, amplitude, and outliers. To see if what we have observed so far shows in the data, we can plot some graphs.

It is also interesting to see how the features are relating to the two classes that will be predicted. To do that, let's import the seaborn package and use the pairplot graph to look at each feature distribution, and each class separation per feature:

import seaborn as sns

# hue='Class' color codes the points according to both classes in the data
sns.pairplot(data=df, hue='Class')

Note: The above code might take a while to run, since the pairplot combines scatterplots of all the features (it can), and also displays the feature distributions.

Looking at the pairplot, we can see that in most cases the points of the Çerçevelik class are clearly separated from the points of the Ürgüp Sivrisi class. Either the points of one class are to the right when the others are to the left, or some are up while the others are down. If we were to use some kind of curve or line to separate classes, this shows it is easier to separate them, if they were mixed, classification would be a harder task.

In the Eccentricity, Compactness and Aspect_Ration columns, some points that are "isolated" or deviating from the general data trend - outliers - are easily spotted as well.

When looking at the diagonal from the upper left to the bottom right of the chart, notice the data distributions are also color-coded according to our classes. The distribution shapes and the distance between both curves are other indicators of how separable they are - the further from each other, the better. In most cases, they aren't superimposed, which implies that they are easier to separate, also contributing to our task.

In sequence, we can also plot the boxplots of all variables with the sns.boxplot() method. Most times, it is helpful to orient the boxplots horizontally, so the shapes of the boxplots are the same as the distribution shapes, we can do that with the orient argument:

# h is from horizontal
sns.boxplot(data=df, orient='h') 

In the plot above, notice that Area and Convex_Area have such a high magnitude when compared to the magnitudes of the other columns, that they squish the other boxplots. To be able to look at all boxplots, we can scale the features and plot them again.

Before doing that, let's just understand that if there are values of features that are intimately related to other values, for instance - if there are values that also get bigger when other feature values get bigger, having a positive correlation; or if there are values that do the opposite, get smaller while other values get smaller, having a negative correlation.

This is important to look at because having strong relationships in data might mean that some columns were derived from other columns or have a similar meaning to our model. When that happens, the model results might be overestimated and we want results that are closer to reality. If there are strong correlations, it also means that we can reduce the number of features, and use fewer columns making the model more parsimonious.

Note: The default correlation calculated with the corr() method is the Pearson correlation coefficient. This coefficient is indicated when data is quantitative, normally distributed, doesn't have outliers, and has a linear relationship.

Another choice would be to calculate Spearman's correlation coefficient. Spearman's coefficient is used when data is ordinal, non-linear, have any distribution, and has outliers. Notice that our data doesn't totally fit into Pearson or Spearman's assumptions (there are also more correlation methods, such as Kendall's). Since our data is quantitative and it is important for us to measure its linear relationship, we will use Pearson's coefficient.

Let's take a look at the correlations between variables and then we can move to pre-process the data. We will calculate the correlations with the corr() method and visualize them with Seaborn's heatmap(). The heatmap standard size tends to be small, so we will import matplotlib (general visualization engine/library that Seaborn is built on top of) and change the size with figsize:

import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))

correlations = df.corr()
sns.heatmap(correlations, annot=True) # annot=True displays correlation values 

In this heatmap, the values closer to 1 or -1 are the values we need to pay attention to. The first case denotes a high positive correlation and the second, a high negative correlation. Both values, if not above 0.8 or -0.8 will be beneficial to our logistic regression model.

When there are high correlations such as the one of 0.99 between Aspec_Ration and Compactness, this means that we can choose to use only Aspec_Ration or only Compactness, instead of both of them (since they'd almost equal predictors of each other). The same holds for Eccentricity and Compactness with a -0.98 correlation, for Area and Perimeter with a 0.94 correlation, and some other columns.

Pre-processing the Data

Since we have already explored the data for a while, we can start pre-processing it. For now, let's use all of the features for the class prediction. After obtaining a first model, a baseline, we can then remove some of the highly correlated columns and compare it to the baseline.

The feature columns will be our X data and the class column, our y target data:

y = df['Class']
X = df.drop(columns=['Class'], axis=1)

Turning Categorical Features into Numeric Features

Regarding our Class column - its values aren't numbers, this means we also need to transform them. There are many ways to do this transformation; here, we will use the replace() method and replace Çerçevelik to 0 and Ürgüp Sivrisi to 1.

y = y.replace('Çerçevelik', 0).replace('Ürgüp Sivrisi', 1)

Keep the mapping in mind! When reading results from your model, you'll want to convert these back at least in your mind, or back into the classname for other users.

Dividing Data into Train and Test Sets

In our exploration, we've noted that the features needed scaling. If we did the scaling now, or in an automatic fashion, we would scale values with the whole of X and y. In that case, we'd introduce data leakage, as the values of the soon-to-be test set would have impacted the scaling. Data leakage is a common cause of irreproducible results and illusory high performance of ML models.

Thinking about the scaling shows that we need to first split X and y data further into train and test sets and then to fit a scaler on the training set, and to transform both the train and test sets (without ever having the test set impact the scaler that does this). For this, we will use Scikit-Learn's train_test_split() method:

from sklearn.model_selection import train_test_split
SEED = 42 # defining a SEED to maintain the same results

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=.25, 
                                                    random_state=SEED)

Setting test_size=.25 is ensuring we are using 25% of the data for testing and 75% for training. This could be omitted, once it is the default split, but the Pythonic way to write code advises that being "explicit is better than implicit".

Note: The sentence "explicit is better than implicit" is a reference to The Zen of Python, or PEP20. It lays out some suggestions for writing Python code. If those suggestions are followed, the code is considered Pythonic. You can know more about it here.

After splitting the data into train and test sets, it is a good practice to look at how many records are in each set. That can be done with the shape attribute:

X_train.shape, X_test.shape, y_train.shape, y_test.shape

This displays:

((1875, 12), (625, 12), (1875,), (625,))

We can see that after the split, we have 1875 records for training and 625 for testing.

Scaling Data

Once we have our train and test sets ready, we can proceed to scale the data with the Scikit-Learn StandardScaler object (or other scalers provided by the library). To avoid leakage, the scaler is fitted to the X_train data and the train values are then used to scale - or transform - both the train and test data:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Since you'll typically call:

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

The first two lines can be collapsed with a singular fit_transform() call, which fits the scaler on the set, and transforms it in one go. We can now reproduce the boxplot graphs to see the difference after scaling data.

Considering the scaling removes column names, prior to plotting, we can organize train data into a dataframe with column names again to facilitate the visualization:

column_names = df.columns[:12] # selecting all columns apart from Class from the original data
X_train = pd.DataFrame(X_train, columns=column_names)

sns.boxplot(data=X_train, orient='h')

We can finally see all of our boxplots! Notice that all of them have outliers, and the features that present a distribution further from normal (that have curves either skewed to the left or right), such as Solidity, Extent, Aspect_Ration, and Compactness, are the same that had higher correlations.

Removing Outliers with IQR Method

We already know that logistic regression can be impacted by outliers. One of the ways of treating them is to use a method called Interquartile Range or IQR. The initial step of the IQR method is to divide our train data into four parts, called quartiles. The first quartile, Q1, amounts to 25% of data, the second, Q2, to 50%, the third, Q3, to 75%, and the last one, Q4, to 100%. The boxes in the boxplot are defined by the IQR method and are a visual representation of it.

Considering a horizontal boxplot, the vertical line on the left marks 25% of the data, the vertical line in the middle, 50% of the data (or the median), and the last vertical line on the right, 75% of the data. The more even in size both squares defined by the vertical lines are - or the more the median vertical line is in the middle - means that our data is closer to the normal distribution or less skewed, which is helpful for our analysis.

Besides the IQR box, there are also horizontal lines on both sides of it. Those lines mark the minimum and maximum distribution values defined by

$$
Minimum = Q1 - 1.5*IQR
$$

and

$$
Maximum = Q3 + 1.5*IQR
$$

IQR is exactly the difference between Q3 and Q1 (or Q3 - Q1) and it is the most central point of data. That is why when finding the IQR, we end up filtering the outliers in the data extremities, or in the minimum and maximum points. Box plots give us a sneak peek of what the result of the IQR method will be.

We can use Pandas quantile() method to find our quantiles, and iqr from the scipy.stats package to obtain the interquartile data range for each column:

from scipy.stats import iqr

Q1 = X_train.quantile(q=.25)
Q3 = X_train.quantile(q=.75)

IQR = X_train.apply(iqr)

Now we have Q1, Q3, and IQR, we can filter out the values closer to the median:

# Calculating minimum and maximum values
minimum = X_train < (Q1-1.5*IQR)
maximum = X_train > (Q3+1.5*IQR)

# The tilde (~) is a reverse operator, and it will select any row that is not below the minimum or above the maximum area, this is our IQR filter
filter = ~(minimum | maximum).any(axis=1)

# We can now select the IQR rows in X_train
X_train = X_train[filter]

After filtering our training rows, we can see how many of them are still in the data with shape:

X_train.shape

This results in:

(1714, 12)

We can see that the number of rows went from 1875 to 1714 after filtering. This means that 161 rows contained outliers or 8.5% of the data.

Note: It is advised that the filtering of outliers, removal of NaN values, and other actions which involve filtering and cleansing data stay below or up to 10% of data. Try thinking of other solutions if your filtering or removal exceeds 10% of your data.

After removing outliers, we are almost ready to include data in the model. For the model fitting, we will use train data. X_train is filtered, but what about y_train?

y_train.shape

This outputs:

(1875,)

Notice that y_train still has 1875 rows. We need to match the number of y_train rows to the number of X_train rows and not just arbitrarily. We need to remove the y-values of the instances of pumpkin seeds that we removed, which are likely scattered through the y_train set. The filtered X_train still has its original indices and the index has gaps where we removed outliers! We can then use the index of the X_train DataFrame to search for the corresponding values in y_train:

y_train = y_train.iloc[X_train.index]

After doing that, we can look at the y_train shape again:

y_train.shape

Which outputs:

(1714,)
Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Now, y_train also has 1714 rows and they are the same as the X_train rows. We are finally ready to create our logistic regression model!

Implementing the Logistic Regression Model

The hard part is done! Preprocessing is usually more difficult than model development, when it comes to using libraries like Scikit-Learn, which have streamlined the application of ML models to just a couple of lines.

First, we import the LogisticRegression class and instantiate it, creating a LogisticRegression object:

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(random_state=SEED)

Second, we fit our train data to the logreg model with the fit() method, and predict our test data with the predict() method, storing the results as y_pred:

# When fitting a DataFrame, rather than a bare NumPy array
# to avoid exceptions, we'll feed only the values without column names
logreg.fit(X_train.values, y_train)
y_pred = logreg.predict(X_test)

We have already made predictions with our model! Let's look at the first 3 rows in X_train to see what data we have used:

X_train[:3]

The code above outputs:

       Area          Perimeter     Major_Axis_Length    Minor_Axis_Length    Convex_Area   Equiv_Diameter       Eccentricity  Solidity      Extent        Roundness     Aspect_Ration        Compactness
0      -1.098308     -0.936518     -0.607941            -1.132551            -1.082768     -1.122359            0.458911      -1.078259     0.562847      -0.176041     0.236617             -0.360134
1      -0.501526     -0.468936     -0.387303            -0.376176            -0.507652     -0.475015            0.125764      0.258195      0.211703      0.094213      -0.122270            0.019480
2      0.012372      -0.209168     -0.354107            0.465095              0.003871      0.054384            -0.453911     0.432515      0.794735      0.647084      -0.617427            0.571137

And at the first 3 predictions in y_pred to see the results:

y_pred[:3] 

This results in:

array([0, 0, 0])

For those three rows, our predictions were that they were seeds of the first class, Çerçevelik.

With logistic regression, instead of predicting the final class, such as 0, we can also predict the probability the row has of pertaining to the 0 class. This is what actually happens when logistic regression classifies data, and the predict() method then passes this prediction through a threshold to return a "hard" class. To predict the probability of pertaining to a class, predict_proba() is used:

y_pred_proba = logreg.predict_proba(X_test)

Let's also take a look at the first 3 values of the y probabilities predictions:

y_pred_proba[:3] 

Which outputs:

        # class 0   class 1   
array([[0.54726628, 0.45273372],
       [0.56324527, 0.43675473],
       [0.86233349, 0.13766651]])

Now, instead of three zeros, we have one column for each class. In the column to the left, starting with 0.54726628, are the probabilities of the data pertaining to the class 0; and in the right column, starting with 0.45273372, are the probability of it pertaining to the class 1.

Note: This difference in classification is also known as hard and soft prediction. Hard prediction boxes the prediction into a class, while soft predictions outputs the probability of the instance belonging to a class.

There is more information on how the predicted output was made. It wasn't actually 0, but a 55% chance of class 0, and a 45% chance of class 1. This surfaces how the first three X_test data points, pertaining to class 0, are really clear only regarding the third data point, with a 86% probability - and not so much for the first two data points.

When communicating findings using ML methods - it's typically best to return a soft class, and the associated probability as the "confidence" of that classification.

We will talk more about how that is calculated when we go deeper into the model. At this time, we can proceed to the next step.

Evaluating the Model with Classification Reports

The third step is to see how the model performs on test data. We can import Scikit-Learn classification_report() and pass our y_test and y_pred as arguments. After that, we can print out its response.

The classification report contains the most used classification metrics, such as precision, recall, f1-score, and accuracy.

  1. Precision: to understand what correct prediction values were considered correct by our classifier. Precision will divide those true positives values by anything that was predicted as a positive:

$$
precision = \frac{\text{true positive}}{\text{true positive} + \text{false positive}}
$$

  1. Recall: to understand how many of the true positives were identified by our classifier. The recall is calculated by dividing the true positives by anything that should have been predicted as positive:

$$
recall = \frac{\text{true positive}}{\text{true positive} + \text{false negative}}
$$

  1. F1 score: is the balanced or harmonic mean of precision and recall. The lowest value is 0 and the highest is 1. When f1-score is equal to 1, it means all classes were correctly predicted - this is a very hard score to obtain with real data:

$$
\text{f1-score} = 2* \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}}
$$

  1. Accuracy: describes how many predictions our classifier got right. The lowest accuracy value is 0 and the highest is 1. That value is usually multiplied by 100 to obtain a percentage:

$$
accuracy = \frac{\text{number of correct predictions}}{\text{total number of predictions}}
$$

Note: It is extremely hard to obtain 100% accuracy on any real data, if that happens, be aware that some leakage or something wrong might be happening - there is no consensus on an ideal accuracy value and it is also context-dependent. A value of 70%, which means the classifier will make mistakes on 30% of the data, or above 70% tends to be sufficient for most models.

from sklearn.metrics import classification_report
cr = classification_report(y_test, y_pred)
print(cr)

We can then look at the classification report output:

                precision    recall  f1-score   support

           0       0.83      0.91      0.87       316
           1       0.90      0.81      0.85       309

    accuracy                           0.86       625
   macro avg       0.86      0.86      0.86       625
weighted avg       0.86      0.86      0.86       625

This is our result. Notice that precision, recall, f1-score, and accuracy metrics are all very high, above 80%, which is ideal - but those results were probably influenced by high correlations, and won't sustain in the long run.

The model's accuracy is 86%, meaning that it gets the classification wrong 14% of the time. We have that overall information, but it would be interesting to know if the 14% mistakes happen regarding the classification of class 0 or class 1. To identify which classes are misidentified as which, and in which frequency - we can compute and plot a confusion matrix of our model's predictions.

Evaluating the Model with a Confusion Matrix

Let's calculate and then plot the confusion matrix. After doing that, we can understand each part of it. To plot the confusion matrix, we'll use Scikit-Learn confusion_matrix(), which we'll import from the metrics module.

The confusion matrix is easier to visualize using a Seaborn heatmap(). So, after generating it, we will pass our confusion matrix as an argument for the heatmap:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
  1. Confusion Matrix: the matrix shows how many samples the model got right or wrong for each class. The values that were correct and correctly predicted are called true positives, and the ones that were predicted as positives but weren't positives are called false positives. The same nomenclature of true negatives and false negatives is used for negative values;

By looking at the confusion matrix plot, we can see that we have 287 values that were 0 and predicted as 0 - or true positives for class 0 (the Çerçevelik seeds). We also have 250 true positives for class 1 (Ürgüp Sivrisi seeds). The true positives are always located in the matrix diagonal that goes from the upper left to the lower right.

We also have 29 values that were supposed to be 0, but predicted as 1 (false positives) and 59 values that were 1 and predicted as 0 (false negatives). With those numbers, we can understand that the error that the model makes the most is that it predicts false negatives. So, it can mostly end up classifying an Ürgüp Sivrisi seed as a Çerçevelik seed.

This kind of error is also explained by the 81% recall of class 1. Notice that the metrics are connected. And the difference in the recall is coming from having 100 fewer samples of the Ürgüp Sivrisi class. This is one of the implications of having just a few samples less than the other class. To further improve recall, you can either experiment with class weights or use more Ürgüp Sivrisi samples.

So far, we have executed most of the data science traditional steps and used the logistic regression model as a black box.

Note: If you want to go further, use Cross Validation (CV) and Grid Search to look for, respectively, the model that generalizes the most regarding data, and the best model parameters that are chosen before training, or hyperparameters.

Ideally, with CV and Grid Search, you could also implement a concatenated way to do data pre-processing steps, data split, modeling, and evaluation - which is made easy with Scikit-Learn pipelines.

Now it's the time to open the black box and look inside of it, to go deeper into understanding how logistic regression works.

Going Deeper into How Logistic Regression Really Works

The regression word is not there by accident, to understand what logistic regression does, we can remember what its sibling, linear regression does to the data. The linear regression formula was the following:

$$
y = b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n
$$

In which b0 was the regression intercept, b1 the coefficient and x1 the data.

That equation resulted in a straight line that was used to predict new values. Recalling the introduction, the difference now is that we won't predict new values, but a class. So that straight line needs to change. With logistic regression, we introduce a non-linearity and the prediction is now made using a curve instead of a line:

Observe that while the linear regression line keeps going and is made of continuous infinite values, the logistic regression curve can be divided in the middle and has extremes in 0 and 1 values. That "S" shape is the reason it classifies data - the points that are closer or fall on the highest extremity belong to class 1, while the points that are in the lower quadrant or closer to 0, belong to class 0. The middle of the "S" is the middle between 0 and 1, 0.5 - it is the threshold for the logistic regression points.

We already understand the visual difference between logistic and linear regression, but what about the formula? The formula for logistic regression is the following:

$$
y = b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n
$$

It can also be written as:

$$
y_{prob} = \frac{1}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}
$$

Or even be written as:

$$
y_{prob} = \frac{e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}
$$

In the equation above, we have the probability of input, instead of its value. It has 1 as its numerator so it can result in a value between 0 and 1, and 1 plus a value in its denominator, so that its value is 1 and something - this means that the whole fraction result can't be bigger than 1.

And what is the value that is in the denominator? It is e, the base of the natural logarithm (approximately 2.718282), raised to the power of linear regression:

$$
e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}
$$

Another way of writing it would be:

$$
ln \left( \frac{p}{1-p} \right) = {(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}
$$

In that last equation, ln is the natural logarithm (base e) and p is the probability, so the logarithm of the probability of the result is the same as the linear regression result.

In other words, with the linear regression result and the natural logarithm, we can arrive at the probability of an input pertaining or not to a designed class.

The whole logistic regression derivation process is the following:

$$
p{X} = \frac{e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}
$$

$$
p(1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}) = e^{(b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)}
$$

$$
p + p*e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)} = e^{(b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)}
$$

p = e ( b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 + + b n x n ) p e ( b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 + + b n x n )

$$
\frac{p}{1-p} = e^{(b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)}
$$

$$
ln \left( \frac{p}{1-p} \right) = (b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)
$$

This means that the logistic regression model also has coefficients and an intercept value. Because it uses a linear regression and adds a non-linear component to it with the natural logarithm (e).

We can see the values of the coefficients and intercept of our model, the same way as we did for linear regression, using coef_ and intercept_ properties:

logreg.coef_

Which displays the coefficients of each of the 12 features:

array([[ 1.43726172, -1.03136968,  0.24099522, -0.61180768,  1.36538261,
        -1.45321951, -1.22826034,  0.98766966,  0.0438686 , -0.78687889,
         1.9601197 , -1.77226097]])
logreg.intercept_

That results in:

array([0.08735782])

With the coefficients and intercept values, we can calculate the predicted probabilities of our data. Let's get the first X_test values again, as an example:

X_test[:1]

This returns the first row of X_test as a NumPy array:

array([[-1.09830823, -0.93651823, -0.60794138, -1.13255059, -1.0827684 ,
        -1.12235877,  0.45891056, -1.07825898,  0.56284738, -0.17604099,
         0.23661678, -0.36013424]])

Following the initial equation:

$$
p{X} = \frac{e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}
$$

In python, we have:

import math

lin_reg = logreg.intercept_[0] + \
((logreg.coef_[0][0]* X_test[:1][0][0])+ \
(logreg.coef_[0][1]* X_test[:1][0][1])+ \
(logreg.coef_[0][2]* X_test[:1][0][2])+ \
(logreg.coef_[0][3]* X_test[:1][0][3])+ \
(logreg.coef_[0][4]* X_test[:1][0][4])+ \
(logreg.coef_[0][5]* X_test[:1][0][5])+ \
(logreg.coef_[0][6]* X_test[:1][0][6])+ \
(logreg.coef_[0][7]* X_test[:1][0][7])+ \
(logreg.coef_[0][8]* X_test[:1][0][8])+ \
(logreg.coef_[0][9]* X_test[:1][0][9])+ \
(logreg.coef_[0][10]* X_test[:1][0][10])+ \
(logreg.coef_[0][11]* X_test[:1][0][11]))

px = math.exp(lin_reg)/(1 +(math.exp(lin_reg)))
px

This results in:

0.45273372469369133

If we look again at the predict_proba result of the first X_test line, we have:

logreg.predict_proba(X_test[:1])

# Output: array([[0.54726628, 0.45273372]])

This means that the original logistic regression equation gives us the probability of the input regarding class 1, to find out which probability is for class 0, we can simply:

1 - px

# Output: 0.5472662753063087

Notice that both px and 1-px are identical to predict_proba results. This is how logistic regression is calculated and why regression is part of its name. But what about the term logistic?

The term logistic comes from logit, which is a function we have already seen:

$$
ln \left( \frac{p}{1-p} \right)
$$

We have just calculated it with px and 1-px. This is the logit, also called log-odds since it is equal to the logarithm of the odds where p is a probability.

Conclusion

In this guide, we have studied one of the most fundamental machine learning classification algorithms, i.e. logistic regression.

Initially, we implemented logistic regression as a black box with Scikit-Learn's machine learning library, and later we understood it step by step to have a clear why and where the terms regression and logistic come from.

We have also explored and studied the data, understanding that is one of the most crucial parts of a data science analysis.

From here, I would advise you to play around with multiclass logistic regression, logistic regression for more than two classes - you can apply the same logistic regression algorithm for other datasets that have multiple classes, and interpret the results.

Note: A good collection of datasets is available here for you to play with.

I would also advise you to study the L1 and L2 regularizations. They are a way to "penalize" the higher data in order for it to become closer to normal, holding out the model's complexity, so the algorithm can get to a better result. The Scikit-Learn implementation we used already has L2 regularization by default. Another thing to look at is the different solvers, such as lbgs, which optimize the logistic regression algorithm performance.

It is also important to take a look at the statistical approach to logistic regression. It has assumptions about the behavior of data, and about other statistics which must hold to guarantee satisfactory results, such as:

  • the observations are independent;
  • there is no multicollinearity among explanatory variables;
  • there are no extreme outliers;
  • there is a linear relationship between explanatory variables and the logit of the response variable;
  • the sample size is sufficiently large.

Notice how many of those assumptions were already covered in our analysis and treatment of data.

I hope you keep exploring what logistic regression has to offer in all its different approaches!

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Cássia SampaioAuthor

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms