### Introduction

Sometimes confused with *linear regression* by novices - due to sharing the term *regression* - *logistic regression* is far different from *linear regression*. While linear regression predicts values such as 2, 2.45, 6.77 or **continuous values**, making it a **regression** algorithm, *logistic regression* predicts values such as 0 or 1, 1 or 2 or 3, which are **discrete values**, making it a **classification** algorithm. Yes, it's called *regression* but is a *classification* algorithm. More on that in a moment.

Therefore, if your data science problem involves continuous values, you can apply a **regression** algorithm (linear regression is one of them). Otherwise, if it involves classifying inputs, discrete values, or classes, you can apply a **classification** algorithm (logistic regression is one of them).

In this guide, we'll be performing logistic regression in Python with the Scikit-Learn library. We will also explain why the word

"regression"is present in the name and how logistic regression works.

To do that, we will first load data that will be classified, visualized, and pre-processed. Then, we will build a logistic regression model that will understand that data. This model will then be evaluated, and employed to predict values based on new input.

### Motivation

The company you work for did a partnership with a Turkish agricultural farm. This partnership involves selling pumpkin seeds. Pumpkin seeds are very important for human nutrition. They contain a good proportion of carbohydrates, fat, protein, calcium, potassium, phosphorus, magnesium, iron, and zinc.

In the data science team, your task is to tell the difference between the types of pumpkin seeds just by using data - or **classifying** the data according to seed type.

The Turkish farm works with two pumpkin seed types, one is called *Çerçevelik* and the other *Ürgüp Sivrisi*.

To classify the pumpkin seeds, your team has followed the 2021 paper *"The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.). Genetic Resources and Crop Evolution"* from Koklu, Sarigil, and Ozbek - in this paper, there is a methodology for photographing and extracting the seeds measurements from the images.

After completing the process described in the paper, the following measurements were extracted:

**Area**- the number of pixels within the borders of a pumpkin seed**Perimeter**- the circumference in pixels of a pumpkin seed**Major Axis Length**- also the circumference in pixels of a pumpkin seed**Minor Axis Length**- the small axis distance of a pumpkin seed**Eccentricity**- the eccentricity of a pumpkin seed**Convex Area**- the number of pixels of the smallest convex shell at the region formed by the pumpkin seed**Extent**- the ratio of a pumpkin seed area to the bounding box pixels**Equivalent Diameter**- the square root of the multiplication of the area of the pumpkin seed by four divided by pi**Compactness**- the proportion of the area of the pumpkin seed relative to the area of the circle with the same circumference**Solidity**- the convex and convex condition of the pumpkin seeds**Roundness**- the ovality of pumpkin seeds without considering its edges distortions**Aspect Ratio**- the aspect ratio of the pumpkin seeds

Those are the measurements you have to work with. Besides the measurements, there is also the **Class** label for the two types of pumpkin seeds.

To start classifying the seeds, let's import the data and begin to look at it.

### Understanding the Dataset

**Note:** You can download the pumpkin dataset here.

After downloading the dataset, we can load it into a dataframe structure using the `pandas`

library. Since it is an excel file, we will use the `read_excel()`

method:

```
import pandas as pd
fpath = 'dataset/pumpkin_seeds_dataset.xlsx' # substitute with the path to your Excel file
df = pd.read_excel(fpath)
```

Once the data is loaded in, we can take a quick peek at the first 5 rows using the `head()`

method:

```
df.head()
```

This results in:

```
Area Perimeter Major_Axis_Length Minor_Axis_Length Convex_Area Equiv_Diameter Eccentricity Solidity Extent Roundness Aspect_Ration Compactness Class
0 56276 888.242 326.1485 220.2388 56831 267.6805 0.7376 0.9902 0.7453 0.8963 1.4809 0.8207 Çerçevelik
1 76631 1068.146 417.1932 234.2289 77280 312.3614 0.8275 0.9916 0.7151 0.8440 1.7811 0.7487 Çerçevelik
2 71623 1082.987 435.8328 211.0457 72663 301.9822 0.8749 0.9857 0.7400 0.7674 2.0651 0.6929 Çerçevelik
3 66458 992.051 381.5638 222.5322 67118 290.8899 0.8123 0.9902 0.7396 0.8486 1.7146 0.7624 Çerçevelik
4 66107 998.146 383.8883 220.4545 67117 290.1207 0.8187 0.9850 0.6752 0.8338 1.7413 0.7557 Çerçevelik
```

Here, we have all the measurements in their respective columns, our **features**, and also the *Class* column, our **target**, which is the last one in the dataframe. We can see how many measurements we have using the `shape`

attribute:

```
df.shape
```

The output is:

```
(2500, 13)
```

The shape result tells us that there are 2500 entries (or rows) in the dataset and 13 columns. Since we know there is one target column - this means we have 12 feature columns.

We can now explore the target variable, the pumpkin seed `Class`

. Since we will predict that variable, it is interesting to see how many samples of each pumpkin seed we have. Usually, the smaller the difference between the number of instances in our classes, the more balanced our sample is and the better our predictions.

This inspection can be done by counting each seed sample with the `value_counts()`

method:

```
df['Class'].value_counts()
```

The above code displays:

```
Çerçevelik 1300
Ürgüp Sivrisi 1200
Name: Class, dtype: int64
```

We can see that there are 1300 samples of the *Çerçevelik* seed and 1200 samples of the *Ürgüp Sivrisi* seed. Notice that the difference between them is 100 samples, a very small difference, which is good for us and indicates there is no need to rebalance the number of samples.

Let's also look at the descriptive statistics of our features with the `describe()`

method to see how well distributed the data is. We will also transpose the resulting table with `T`

to make it easier to compare across statistics:

```
df.describe().T
```

The resulting table is:

```
count mean std min 25% 50% 75% max
Area 2500.0 80658.220800 13664.510228 47939.0000 70765.000000 79076.00000 89757.500000 136574.0000
Perimeter 2500.0 1130.279015 109.256418 868.4850 1048.829750 1123.67200 1203.340500 1559.4500
Major_Axis_Length 2500.0 456.601840 56.235704 320.8446 414.957850 449.49660 492.737650 661.9113
Minor_Axis_Length 2500.0 225.794921 23.297245 152.1718 211.245925 224.70310 240.672875 305.8180
Convex_Area 2500.0 81508.084400 13764.092788 48366.0000 71512.000000 79872.00000 90797.750000 138384.0000
Equiv_Diameter 2500.0 319.334230 26.891920 247.0584 300.167975 317.30535 338.057375 417.0029
Eccentricity 2500.0 0.860879 0.045167 0.4921 0.831700 0.86370 0.897025 0.9481
Solidity 2500.0 0.989492 0.003494 0.9186 0.988300 0.99030 0.991500 0.9944
Extent 2500.0 0.693205 0.060914 0.4680 0.658900 0.71305 0.740225 0.8296
Roundness 2500.0 0.791533 0.055924 0.5546 0.751900 0.79775 0.834325 0.9396
Aspect_Ration 2500.0 2.041702 0.315997 1.1487 1.801050 1.98420 2.262075 3.1444
Compactness 2500.0 0.704121 0.053067 0.5608 0.663475 0.70770 0.743500 0.9049
```

By looking at the table, when comparing the **mean** and **standard deviation** (`std`

) columns, it can be seen that most features have a mean that is far from the standard deviation. That indicates that the data values aren't concentrated around the mean value, but more scattered around it - in other words, they have **high variability**.

Also, when looking at the **minimum** (`min`

) and **maximum** (`max`

) columns, some features, such as `Area`

, and `Convex_Area`

, have big differences between minimum and maximum values. This means that those columns have very small data and also very large data values, or **higher amplitude** between data values.

With high variability, high amplitude, and features with different measurement units, most of our data would benefit from having the same scale for all features or being **scaled**. Data scaling will center data around the mean and reduce its variance.

This scenario probably also indicates that there are outliers and extreme values in data. So, it is best to have some **outlier treatment** besides scaling the data.

There are some machine learning algorithms, for instance, tree-based algorithms such as

Random Forest Classification, that aren't affected by high data variance, outliers, and extreme values.Logistic regressionis different, it is based on a function that categorizes our values, and the parameters of that function can be affected by values that are out of the general data trend and have high variance.

**Advice:** If you'd like to read more about feature scaling - read our *"Feature Scaling Data with Scikit-Learn for Machine Learning"*!

We will understand more about logistic regression in a bit when we get to implement it. For now, we can keep exploring our data.

**Note:** There is a popular saying in Computer Science: *"Garbage in, garbage out" (GIGO)*, that is well suited for machine learning. This means that when we have garbage data - measurements that don't describe the phenomena in themselves, data that wasn't understood and well prepared according to the kind of algorithm or model, will likely generate an incorrect output that won't work on a day to day basis.

This is one of the reasons why exploring, understanding data, and how the chosen model works are so important. By doing that, we can avoid putting garbage in our model - putting value in it instead, and getting value out.

### Visualizing the Data

Up until now, with the descriptive statistics, we have a somewhat abstract snapshot of some qualities of the data. Another important step is to visualize it and confirm our hypothesis of high variance, amplitude, and outliers. To see if what we have observed so far shows in the data, we can plot some graphs.

It is also interesting to see how the features are relating to the two classes that will be predicted. To do that, let's import the `seaborn`

package and use the `pairplot`

graph to look at each feature distribution, and each class separation per feature:

```
import seaborn as sns
# hue='Class' color codes the points according to both classes in the data
sns.pairplot(data=df, hue='Class')
```

**Note:** The above code might take a while to run, since the pairplot combines scatterplots of all the features (it can), and also displays the feature distributions.

Looking at the pairplot, we can see that in most cases the points of the `Çerçevelik`

class are clearly separated from the points of the `Ürgüp Sivrisi`

class. Either the points of one class are to the right when the others are to the left, or some are up while the others are down. If we were to use some kind of curve or line to separate classes, this shows it is easier to separate them, if they were mixed, classification would be a harder task.

In the `Eccentricity`

, `Compactness`

and `Aspect_Ration`

columns, some points that are "isolated" or deviating from the general data trend - outliers - are easily spotted as well.

When looking at the diagonal from the upper left to the bottom right of the chart, notice the data distributions are also color-coded according to our classes. The distribution shapes and the distance between both curves are other indicators of how separable they are - the further from each other, the better. In most cases, they aren't superimposed, which implies that they are easier to separate, also contributing to our task.

In sequence, we can also plot the boxplots of all variables with the `sns.boxplot()`

method. Most times, it is helpful to orient the boxplots horizontally, so the shapes of the boxplots are the same as the distribution shapes, we can do that with the `orient`

argument:

```
# h is from horizontal
sns.boxplot(data=df, orient='h')
```

In the plot above, notice that `Area`

and `Convex_Area`

have such a high magnitude when compared to the magnitudes of the other columns, that they squish the other boxplots. To be able to look at all boxplots, we can scale the features and plot them again.

Before doing that, let's just understand that if there are values of features that are intimately related to other values, for instance - if there are values that also get bigger when other feature values get bigger, having a **positive correlation**; or if there are values that do the opposite, get smaller while other values get smaller, having a **negative correlation**.

This is important to look at because having strong relationships in data might mean that some columns were derived from other columns or have a similar meaning to our model. When that happens, the model results might be overestimated and we want results that are closer to reality. If there are strong correlations, it also means that we can reduce the number of features, and use fewer columns making the model more **parsimonious**.

**Note:** The default correlation calculated with the `corr()`

method is the *Pearson correlation coefficient*. This coefficient is indicated when data is quantitative, normally distributed, doesn't have outliers, and has a linear relationship.

Another choice would be to calculate *Spearman's correlation coefficient*. Spearman's coefficient is used when data is ordinal, non-linear, have any distribution, and has outliers. Notice that our data doesn't totally fit into Pearson or Spearman's assumptions (there are also more correlation methods, such as Kendall's). Since our data is quantitative and it is important for us to measure its linear relationship, we will use Pearson's coefficient.

Let's take a look at the correlations between variables and then we can move to pre-process the data. We will calculate the correlations with the `corr()`

method and visualize them with Seaborn's `heatmap()`

. The heatmap standard size tends to be small, so we will import `matplotlib`

(general visualization engine/library that Seaborn is built on top of) and change the size with `figsize`

:

```
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))
correlations = df.corr()
sns.heatmap(correlations, annot=True) # annot=True displays correlation values
```

In this heatmap, the values closer to 1 or -1 are the values we need to pay attention to. The first case denotes a high positive correlation and the second, a high negative correlation. Both values, if not above 0.8 or -0.8 will be beneficial to our logistic regression model.

When there are high correlations such as the one of `0.99`

between `Aspec_Ration`

and `Compactness`

, this means that we can choose to use only `Aspec_Ration`

or only `Compactness`

, instead of both of them (since they'd almost equal *predictors* of each other). The same holds for `Eccentricity`

and `Compactness`

with a `-0.98`

correlation, for `Area`

and `Perimeter`

with a `0.94`

correlation, and some other columns.

### Pre-processing the Data

Since we have already explored the data for a while, we can start pre-processing it. For now, let's use all of the features for the class prediction. After obtaining a first model, a baseline, we can then remove some of the highly correlated columns and compare it to the baseline.

The feature columns will be our `X`

data and the class column, our `y`

target data:

```
y = df['Class']
X = df.drop(columns=['Class'], axis=1)
```

#### Turning Categorical Features into Numeric Features

Regarding our `Class`

column - its values aren't numbers, this means we also need to transform them. There are many ways to do this transformation; here, we will use the `replace()`

method and replace `Çerçevelik`

to `0`

and `Ürgüp Sivrisi`

to `1`

.

```
y = y.replace('Çerçevelik', 0).replace('Ürgüp Sivrisi', 1)
```

Keep the mapping in mind! When reading results from your model, you'll want to convert these back at least in your mind, or back into the classname for other users.

#### Dividing Data into Train and Test Sets

In our exploration, we've noted that the features needed scaling. If we did the scaling now, or in an automatic fashion, we would scale values with the whole of `X`

and `y`

. In that case, we'd introduce *data leakage*, as the values of the soon-to-be test set would have impacted the scaling. Data leakage is a common cause of irreproducible results and illusory high performance of ML models.

Thinking about the scaling shows that we need to first split `X`

and `y`

data further into train and test sets and then to *fit* a scaler on the training set, and to *transform* both the train and test sets (without ever having the test set impact the scaler that does this). For this, we will use Scikit-Learn's `train_test_split()`

method:

```
from sklearn.model_selection import train_test_split
SEED = 42 # defining a SEED to maintain the same results
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=.25,
random_state=SEED)
```

Setting `test_size=.25`

is ensuring we are using 25% of the data for testing and 75% for training. This could be omitted, once it is the default split, but the *Pythonic* way to write code advises that being "explicit is better than implicit".

**Note:** The sentence "explicit is better than implicit" is a reference to *The Zen of Python*, or PEP20. It lays out some suggestions for writing Python code. If those suggestions are followed, the code is considered *Pythonic*. You can know more about it here.

After splitting the data into train and test sets, it is a good practice to look at how many records are in each set. That can be done with the `shape`

attribute:

```
X_train.shape, X_test.shape, y_train.shape, y_test.shape
```

This displays:

```
((1875, 12), (625, 12), (1875,), (625,))
```

We can see that after the split, we have 1875 records for training and 625 for testing.

#### Scaling Data

Once we have our train and test sets ready, we can proceed to scale the data with the Scikit-Learn `StandardScaler`

object (or other scalers provided by the library). To avoid leakage, the scaler is fitted to the `X_train`

data and the train values are then used to scale - or transform - both the train and test data:

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

Since you'll typically call:

```
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
```

The first two lines can be collapsed with a singular `fit_transform()`

call, which fits the scaler on the set, and transforms it in one go. We can now reproduce the boxplot graphs to see the difference after scaling data.

Considering the scaling removes column names, prior to plotting, we can organize train data into a dataframe with column names again to facilitate the visualization:

```
column_names = df.columns[:12] # selecting all columns apart from Class from the original data
X_train = pd.DataFrame(X_train, columns=column_names)
sns.boxplot(data=X_train, orient='h')
```

We can finally see all of our boxplots! Notice that all of them have outliers, and the features that present a distribution further from normal (that have curves either skewed to the left or right), such as `Solidity`

, `Extent`

, `Aspect_Ration`

, and `Compactness`

, are the same that had higher correlations.

#### Removing Outliers with IQR Method

We already know that logistic regression can be impacted by outliers. One of the ways of treating them is to use a method called **Interquartile Range** or **IQR**. The initial step of the IQR method is to divide our train data into four parts, called quartiles. The first quartile, **Q1**, amounts to 25% of data, the second, **Q2**, to 50%, the third, **Q3**, to 75%, and the last one, **Q4**, to 100%. The boxes in the boxplot are defined by the IQR method and are a visual representation of it.

Considering a horizontal boxplot, the vertical line on the left marks 25% of the data, the vertical line in the middle, 50% of the data (or the median), and the last vertical line on the right, 75% of the data. The more even in size both squares defined by the vertical lines are - or the more the median vertical line is in the middle - means that our data is closer to the normal distribution or less skewed, which is helpful for our analysis.

Besides the IQR box, there are also horizontal lines on both sides of it. Those lines mark the minimum and maximum distribution values defined by

$$

Minimum = Q1 - 1.5*IQR

$$

and

$$

Maximum = Q3 + 1.5*IQR

$$

IQR is exactly the difference between Q3 and Q1 (or Q3 - Q1) and it is the most central point of data. That is why when finding the IQR, we end up filtering the outliers in the data extremities, or in the minimum and maximum points. Box plots give us a sneak peek of what the result of the IQR method will be.

We can use Pandas `quantile()`

method to find our quantiles, and `iqr`

from the `scipy.stats`

package to obtain the interquartile data range for each column:

```
from scipy.stats import iqr
Q1 = X_train.quantile(q=.25)
Q3 = X_train.quantile(q=.75)
IQR = X_train.apply(iqr)
```

Now we have Q1, Q3, and IQR, we can filter out the values closer to the median:

```
# Calculating minimum and maximum values
minimum = X_train < (Q1-1.5*IQR)
maximum = X_train > (Q3+1.5*IQR)
# The tilde (~) is a reverse operator, and it will select any row that is not below the minimum or above the maximum area, this is our IQR filter
filter = ~(minimum | maximum).any(axis=1)
# We can now select the IQR rows in X_train
X_train = X_train[filter]
```

After filtering our training rows, we can see how many of them are still in the data with `shape`

:

```
X_train.shape
```

This results in:

```
(1714, 12)
```

We can see that the number of rows went from 1875 to 1714 after filtering. This means that 161 rows contained outliers or 8.5% of the data.

**Note:** It is advised that the filtering of outliers, removal of NaN values, and other actions which involve filtering and cleansing data stay below or up to 10% of data. Try thinking of other solutions if your filtering or removal exceeds 10% of your data.

After removing outliers, we are almost ready to include data in the model. For the model fitting, we will use train data. `X_train`

is filtered, but what about `y_train`

?

```
y_train.shape
```

This outputs:

```
(1875,)
```

Notice that `y_train`

still has 1875 rows. We need to match the number of `y_train`

rows to the number of `X_train`

rows and not just arbitrarily. We need to remove the y-values of the instances of pumpkin seeds that we removed, which are likely scattered through the `y_train`

set. The filtered `X_train`

still has its original indices and the index has gaps where we removed outliers! We can then use the index of the `X_train`

DataFrame to search for the corresponding values in `y_train`

:

```
y_train = y_train.iloc[X_train.index]
```

After doing that, we can look at the `y_train`

shape again:

```
y_train.shape
```

Which outputs:

```
(1714,)
```

## Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually *learn* it!

Now, `y_train`

also has 1714 rows and they are the same as the `X_train`

rows. We are finally ready to create our logistic regression model!

### Implementing the Logistic Regression Model

The hard part is done! Preprocessing is usually more difficult than model development, when it comes to using libraries like Scikit-Learn, which have streamlined the application of ML models to just a couple of lines.

First, we import the `LogisticRegression`

class and instantiate it, creating a `LogisticRegression`

object:

```
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(random_state=SEED)
```

Second, we fit our train data to the `logreg`

model with the `fit()`

method, and predict our test data with the `predict()`

method, storing the results as `y_pred`

:

```
# When fitting a DataFrame, rather than a bare NumPy array
# to avoid exceptions, we'll feed only the values without column names
logreg.fit(X_train.values, y_train)
y_pred = logreg.predict(X_test)
```

We have already made predictions with our model! Let's look at the first 3 rows in `X_train`

to see what data we have used:

```
X_train[:3]
```

The code above outputs:

```
Area Perimeter Major_Axis_Length Minor_Axis_Length Convex_Area Equiv_Diameter Eccentricity Solidity Extent Roundness Aspect_Ration Compactness
0 -1.098308 -0.936518 -0.607941 -1.132551 -1.082768 -1.122359 0.458911 -1.078259 0.562847 -0.176041 0.236617 -0.360134
1 -0.501526 -0.468936 -0.387303 -0.376176 -0.507652 -0.475015 0.125764 0.258195 0.211703 0.094213 -0.122270 0.019480
2 0.012372 -0.209168 -0.354107 0.465095 0.003871 0.054384 -0.453911 0.432515 0.794735 0.647084 -0.617427 0.571137
```

And at the first 3 predictions in `y_pred`

to see the results:

```
y_pred[:3]
```

This results in:

```
array([0, 0, 0])
```

For those three rows, our predictions were that they were seeds of the first class, `Çerçevelik`

.

With *logistic regression*, instead of predicting the final class, such as `0`

, we can also predict the probability the row has of pertaining to the `0`

class. This is what actually happens when logistic regression classifies data, and the `predict()`

method then passes this prediction through a threshold to return a "hard" class. To predict the probability of pertaining to a class, `predict_proba()`

is used:

```
y_pred_proba = logreg.predict_proba(X_test)
```

Let's also take a look at the first 3 values of the y probabilities predictions:

```
y_pred_proba[:3]
```

Which outputs:

```
# class 0 class 1
array([[0.54726628, 0.45273372],
[0.56324527, 0.43675473],
[0.86233349, 0.13766651]])
```

Now, instead of three zeros, we have one column for each class. In the column to the left, starting with `0.54726628`

, are the probabilities of the data pertaining to the class `0`

; and in the right column, starting with `0.45273372`

, are the probability of it pertaining to the class `1`

.

**Note:** This difference in classification is also known as *hard* and *soft* prediction. Hard prediction boxes the prediction into a class, while soft predictions outputs the *probability* of the instance belonging to a class.

There is more information on how the predicted output was made. It wasn't actually `0`

, but a 55% chance of class `0`

, and a 45% chance of class `1`

. This surfaces how the first three `X_test`

data points, pertaining to class `0`

, are really clear only regarding the third data point, with a 86% probability - and not so much for the first two data points.

When communicating findings using ML methods - it's typically best to return a soft class, and the associated probability as the

"confidence"of that classification.

We will talk more about how that is calculated when we go deeper into the model. At this time, we can proceed to the next step.

### Evaluating the Model with Classification Reports

The third step is to see how the model performs on test data. We can import Scikit-Learn `classification_report()`

and pass our `y_test`

and `y_pred`

as arguments. After that, we can print out its response.

The classification report contains the most used classification metrics, such as **precision**, **recall**, **f1-score**, and **accuracy**.

**Precision**: to understand what correct prediction values were considered correct by our classifier. Precision will divide those true positives values by anything that was predicted as a positive:

$$

precision = \frac{\text{true positive}}{\text{true positive} + \text{false positive}}

$$

**Recall**: to understand how many of the true positives were identified by our classifier. The recall is calculated by dividing the true positives by anything that should have been predicted as positive:

$$

recall = \frac{\text{true positive}}{\text{true positive} + \text{false negative}}

$$

**F1 score**: is the balanced or*harmonic mean*of precision and recall. The lowest value is 0 and the highest is 1. When`f1-score`

is equal to 1, it means all classes were correctly predicted - this is a very hard score to obtain with real data:

$$

\text{f1-score} = 2* \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}}

$$

**Accuracy**: describes how many predictions our classifier got right. The lowest accuracy value is 0 and the highest is 1. That value is usually multiplied by 100 to obtain a percentage:

$$

accuracy = \frac{\text{number of correct predictions}}{\text{total number of predictions}}

$$

**Note:** It is extremely hard to obtain 100% accuracy on any real data, if that happens, be aware that some leakage or something wrong might be happening - there is no consensus on an ideal accuracy value and it is also context-dependent. A value of 70%, which means the classifier will make mistakes on 30% of the data, or above 70% tends to be sufficient for most models.

```
from sklearn.metrics import classification_report
cr = classification_report(y_test, y_pred)
print(cr)
```

We can then look at the classification report output:

```
precision recall f1-score support
0 0.83 0.91 0.87 316
1 0.90 0.81 0.85 309
accuracy 0.86 625
macro avg 0.86 0.86 0.86 625
weighted avg 0.86 0.86 0.86 625
```

This is our result. Notice that `precision`

, `recall`

, `f1-score`

, and `accuracy`

metrics are all very high, above 80%, which is ideal - but those results were probably influenced by high correlations, and won't sustain in the long run.

The model's accuracy is 86%, meaning that it gets the classification wrong 14% of the time. We have that overall information, but it would be interesting to know if the 14% mistakes happen regarding the classification of class `0`

or class `1`

. To identify which classes are misidentified as which, and in which frequency - we can compute and plot a *confusion matrix* of our model's predictions.

### Evaluating the Model with a Confusion Matrix

Let's calculate and then plot the confusion matrix. After doing that, we can understand each part of it. To plot the confusion matrix, we'll use Scikit-Learn `confusion_matrix()`

, which we'll import from the `metrics`

module.

The confusion matrix is easier to visualize using a Seaborn `heatmap()`

. So, after generating it, we will pass our confusion matrix as an argument for the heatmap:

```
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
```

**Confusion Matrix**: the matrix shows how many samples the model got right or wrong for each class. The values that were correct and correctly predicted are called, and the ones that were predicted as positives but weren't positives are called**true positives**. The same nomenclature of**false positives**and**true negatives**is used for negative values;**false negatives**

By looking at the confusion matrix plot, we can see that we have `287`

values that were `0`

and predicted as `0`

- or *true positives* for class `0`

(the Çerçevelik seeds). We also have `250`

true positives for class `1`

(Ürgüp Sivrisi seeds). The true positives are always located in the matrix diagonal that goes from the upper left to the lower right.

We also have `29`

values that were supposed to be `0`

, but predicted as `1`

(*false positives*) and `59`

values that were `1`

and predicted as `0`

(*false negatives*). With those numbers, we can understand that the error that the model makes the most is that it predicts false negatives. So, it can mostly end up classifying an Ürgüp Sivrisi seed as a Çerçevelik seed.

This kind of error is also explained by the 81% recall of class `1`

. Notice that the metrics are connected. And the difference in the recall is coming from having 100 fewer samples of the Ürgüp Sivrisi class. This is one of the implications of having just a few samples less than the other class. To further improve recall, you can either experiment with class weights or use more Ürgüp Sivrisi samples.

So far, we have executed most of the data science traditional steps and used the logistic regression model as a black box.

**Note:** If you want to go further, use Cross Validation (CV) and Grid Search to look for, respectively, the model that generalizes the most regarding data, and the best model parameters that are chosen before training, or **hyperparameters**.

Ideally, with CV and Grid Search, you could also implement a concatenated way to do data pre-processing steps, data split, modeling, and evaluation - which is made easy with Scikit-Learn pipelines.

Now it's the time to open the black box and look inside of it, to go deeper into understanding how logistic regression works.

### Going Deeper into How Logistic Regression Really Works

The *regression* word is not there by accident, to understand what logistic regression does, we can remember what its sibling, linear regression does to the data. The linear regression formula was the following:

$$

y = b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n

$$

In which b_{0} was the regression intercept, b^{1} the coefficient and x^{1} the data.

That equation resulted in a straight line that was used to predict new values. Recalling the introduction, the difference now is that we won't predict new values, but a class. So that straight line needs to change. With logistic regression, we introduce a non-linearity and the prediction is now made using a curve instead of a line:

Observe that while the linear regression line keeps going and is made of continuous infinite values, the logistic regression curve can be divided in the middle and has extremes in 0 and 1 values. That "S" shape is the reason it classifies data - the points that are closer or fall on the highest extremity belong to class 1, while the points that are in the lower quadrant or closer to 0, belong to class 0. The middle of the "S" is the middle between 0 and 1, 0.5 - it is the threshold for the logistic regression points.

We already understand the visual difference between logistic and linear regression, but what about the formula? The formula for logistic regression is the following:

$$

y = b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n

$$

It can also be written as:

$$

y_{prob} = \frac{1}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}

$$

Or even be written as:

$$

y_{prob} = \frac{e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}

$$

In the equation above, we have the probability of input, instead of its value. It has 1 as its numerator so it can result in a value between 0 and 1, and 1 plus a value in its denominator, so that its value is 1 and something - this means that the whole fraction result can't be bigger than 1.

And what is the value that is in the denominator? It is **e**, the base of the natural logarithm (approximately 2.718282), raised to the power of linear regression:

$$

e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}

$$

Another way of writing it would be:

$$

ln \left( \frac{p}{1-p} \right) = {(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}

$$

In that last equation, **ln** is the natural logarithm (base e) and **p** is the probability, so the logarithm of the probability of the result is the same as the linear regression result.

In other words, with the linear regression result and the natural logarithm, we can arrive at the probability of an input pertaining or not to a designed class.

The whole logistic regression derivation process is the following:

$$

p{X} = \frac{e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}

$$

$$

p(1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}) = e^{(b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)}

$$

$$

p + p*e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)} = e^{(b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)}

$$

$$

\frac{p}{1-p} = e^{(b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)}

$$

$$

ln \left( \frac{p}{1-p} \right) = (b_0 + b_1 * x_1 + b_2 *x_2 + b_3 * x_3 + \ldots + b_n * x_n)

$$

This means that the logistic regression model also has coefficients and an intercept value. Because it uses a linear regression and adds a non-linear component to it with the natural logarithm (`e`

).

We can see the values of the coefficients and intercept of our model, the same way as we did for linear regression, using `coef_`

and `intercept_`

properties:

```
logreg.coef_
```

Which displays the coefficients of each of the 12 features:

```
array([[ 1.43726172, -1.03136968, 0.24099522, -0.61180768, 1.36538261,
-1.45321951, -1.22826034, 0.98766966, 0.0438686 , -0.78687889,
1.9601197 , -1.77226097]])
```

```
logreg.intercept_
```

That results in:

```
array([0.08735782])
```

With the coefficients and intercept values, we can calculate the predicted probabilities of our data. Let's get the first `X_test`

values again, as an example:

```
X_test[:1]
```

This returns the first row of `X_test`

as a NumPy array:

```
array([[-1.09830823, -0.93651823, -0.60794138, -1.13255059, -1.0827684 ,
-1.12235877, 0.45891056, -1.07825898, 0.56284738, -0.17604099,
0.23661678, -0.36013424]])
```

Following the initial equation:

$$

p{X} = \frac{e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}{1 + e^{(b_0 + b_1 * x_1 + b_2 * x_2 + b_3 * x_3 + \ldots + b_n * x_n)}}

$$

In python, we have:

```
import math
lin_reg = logreg.intercept_[0] + \
((logreg.coef_[0][0]* X_test[:1][0][0])+ \
(logreg.coef_[0][1]* X_test[:1][0][1])+ \
(logreg.coef_[0][2]* X_test[:1][0][2])+ \
(logreg.coef_[0][3]* X_test[:1][0][3])+ \
(logreg.coef_[0][4]* X_test[:1][0][4])+ \
(logreg.coef_[0][5]* X_test[:1][0][5])+ \
(logreg.coef_[0][6]* X_test[:1][0][6])+ \
(logreg.coef_[0][7]* X_test[:1][0][7])+ \
(logreg.coef_[0][8]* X_test[:1][0][8])+ \
(logreg.coef_[0][9]* X_test[:1][0][9])+ \
(logreg.coef_[0][10]* X_test[:1][0][10])+ \
(logreg.coef_[0][11]* X_test[:1][0][11]))
px = math.exp(lin_reg)/(1 +(math.exp(lin_reg)))
px
```

This results in:

```
0.45273372469369133
```

If we look again at the `predict_proba`

result of the first `X_test`

line, we have:

```
logreg.predict_proba(X_test[:1])
# Output: array([[0.54726628, 0.45273372]])
```

This means that the original logistic regression equation gives us the probability of the input regarding class `1`

, to find out which probability is for class `0`

, we can simply:

```
1 - px
# Output: 0.5472662753063087
```

Notice that both `px`

and `1-px`

are identical to `predict_proba`

results. This is how logistic regression is calculated and why *regression* is part of its name. But what about the term *logistic*?

The term *logistic* comes from **logit**, which is a function we have already seen:

$$

ln \left( \frac{p}{1-p} \right)

$$

We have just calculated it with `px`

and `1-px`

. This is the logit, also called **log-odds** since it is equal to the logarithm of the odds where `p`

is a probability.

### Conclusion

In this guide, we have studied one of the most fundamental machine learning classification algorithms, i.e. *logistic regression*.

Initially, we implemented logistic regression as a black box with Scikit-Learn's machine learning library, and later we understood it step by step to have a clear why and where the terms regression and logistic come from.

We have also explored and studied the data, understanding that is one of the most crucial parts of a data science analysis.

From here, I would advise you to play around with **multiclass logistic regression**, logistic regression for more than two classes - you can apply the same logistic regression algorithm for other datasets that have multiple classes, and interpret the results.

**Note:** A good collection of datasets is available here for you to play with.

I would also advise you to study the L1 and L2 **regularizations**. They are a way to "penalize" the higher data in order for it to become closer to normal, holding out the model's complexity, so the algorithm can get to a better result. The Scikit-Learn implementation we used already has L2 regularization by default. Another thing to look at is the different **solvers**, such as `lbgs`

, which optimize the logistic regression algorithm performance.

It is also important to take a look at the **statistical** approach to logistic regression. It has **assumptions** about the behavior of data, and about other statistics which must hold to guarantee satisfactory results, such as:

- the observations are independent;
- there is no multicollinearity among explanatory variables;
- there are no extreme outliers;
- there is a linear relationship between explanatory variables and the logit of the response variable;
- the sample size is sufficiently large.

Notice how many of those assumptions were already covered in our analysis and treatment of data.

I hope you keep exploring what logistic regression has to offer in all its different approaches!