Introduction
The K-nearest Neighbors (KNN) algorithm is a type of supervised machine learning algorithm used for classification, regression as well as outlier detection. It is extremely easy to implement in its most basic form but can perform fairly complex tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses all of the data for training while classifying (or regressing) a new data point or instance.
KNN is a non-parametric learning algorithm, which means that it doesn't assume anything about the underlying data. This is an extremely useful feature since most of the real-world data doesn't really follow any theoretical assumption e.g. linear separability, uniform distribution, etc.
In this guide, we will see how KNN can be implemented with Python's Scikit-Learn library. Before that we'll first explore how we can use KNN and explain the theory behind it. After that, we'll take a look at the California Housing dataset we'll be using to illustrate the KNN algorithm and several of its variations. First of all, we'll take a look at how to implement the KNN algorithm for the regression, followed by implementations of the KNN classification and the outlier detection. In the end, we'll conclude with some of the pros and cons of the algorithm.
When Should You Use KNN?
Suppose you wanted to rent an apartment and recently found out your friend's neighbor might put her apartment up for rent in 2 weeks. Since the apartment isn't on a rental website yet, how could you try to estimate its rental value?
Let's say your friend pays $1,200 in rent. Your rent value might be around that number, but the apartments aren't exactly the same (orientation, area, furniture quality, etc.), so, it would be nice to have more data on other apartments.
By asking other neighbors and looking at the apartments from the same building that were listed on a rental website, the closest three neighboring apartment rents are $1,200, $1,210, $1,210, and $1,215. Those apartments are on the same block and floor as your friend's apartment.
Other apartments that are further away, on the same floor, but in a different block have rents of $1,400, $1,430, $1,500, and $1,470. It seems they are more expensive due to having more light from the sun in the evening.
Considering the apartment's proximity, it seems your estimated rent would be around $1,210. That is the general idea of what the K-Nearest Neighbors (KNN) algorithm does! It classifies or regresses new data based on its proximity to already existing data.
Translate the Example into Theory
When the estimated value is a continuous number, such as the rent value, KNN is used for regression. But we could also divide apartments into categories based on the minimum and maximum rent, for instance. When the value is discrete, making it a category, KNN is used for classification.
There is also the possibility of estimating which neighbors are so different from others that they will probably stop paying rent. This is the same as detecting which data points are so far away that they don't fit into any value or category, when that happens, KNN is used for outlier detection.
In our example, we also already knew the rents of each apartment, which means our data was labeled. KNN uses previously labeled data, which makes it a supervised learning algorithm.
KNN is extremely easy to implement in its most basic form, and yet performs quite complex classification, regression, or outlier detection tasks.
Each time there is a new point added to the data, KNN uses just one part of the data for deciding the value (regression) or class (classification) of that added point. Since it doesn't have to look at all the points again, this makes it a lazy learning algorithm.
KNN also doesn't assume anything about the underlying data characteristics, it doesn't expect the data to fit into some type of distribution, such as uniform, or to be linearly separable. This means it is a non-parametric learning algorithm. This is an extremely useful feature since most of the real-world data doesn't really follow any theoretical assumption.
Visualizing Different Uses of the KNN
As it has been shown, the intuition behind the KNN algorithm is one of the most direct of all the supervised machine learning algorithms. The algorithm first calculates the distance of a new data point to all other training data points.
Note: The distance can be measured in different ways. You can use a Minkowski, Euclidean, Manhattan, Mahalanobis or Hamming formula, to name a few metrics. With high dimensional data, Euclidean distance often starts failing (high dimensionality is... weird), and Manhattan distance is used instead.
After calculating the distance, KNN selects a number of nearest data points - 2, 3, 10, or really, any integer. This number of points (2, 3, 10, etc.) is the K in K-Nearest Neighbors!
In the final step, if it is a regression task, KNN will calculate the average weighted sum of the K-nearest points for the prediction. If it is a classification task, the new data point will be assigned to the class to which the majority of the selected K-nearest points belong.
Let's visualize the algorithm in action with the help of a simple example. Consider a dataset with two variables and a K of 3.
When performing regression, the task is to find the value of a new data point, based on the average weighted sum of the 3 nearest points.
KNN with K = 3
, when used for regression:
The KNN algorithm will start by calculating the distance of the new point from all the points. It then finds the 3 points with the least distance to the new point. This is shown in the second figure above, in which the three nearest points, 47
, 58
, and 79
have been encircled. After that, it calculates the weighted sum of 47
, 58
and 79
- in this case the weights are equal to 1 - we are considering all points as equals, but we could also assign different weights based on distance. After calculating the weighted sum, the new point value is 61,33
.
And when performing a classification, the KNN task is to classify a new data point, into the "Purple"
or "Red"
class.
KNN with K = 3
, when used for classification:
The KNN algorithm will start in the same way as before, by calculating the distance of the new point from all the points, finding the 3 nearest points with the least distance to the new point, and then, instead of calculating a number, it assigns the new point to the class to which majority of the three nearest points belong, the red class. Therefore the new data point will be classified as "Red"
.
The outlier detection process is different from both above, we will talk more about it when implementing it after the regression and classification implementations.
Note: The code provided in this tutorial has been executed and tested with the following Jupyter notebook.
The Scikit-Learn California Housing Dataset
We are going to use the California housing dataset to illustrate how the KNN algorithm works. The dataset was derived from the 1990 U.S. census. One row of the dataset represents the census of one block group.
In this section, we'll go over the details of the California Housing Dataset, so you can gain an intuitive understanding of the data we'll be working with. It's very important to get to know your data before you start working on it.
A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data. Besides block group, another term used is household, a household is a group of people residing within a home.
The dataset consists of nine attributes:
MedInc
- median income in block groupHouseAge
- median house age in a block groupAveRooms
- the average number of rooms (provided per household)AveBedrms
- the average number of bedrooms (provided per household)Population
- block group populationAveOccup
- the average number of household membersLatitude
- block group latitudeLongitude
- block group longitudeMedHouseVal
- median house value for California districts (hundreds of thousands of dollars)
The dataset is already part of the Scikit-Learn library, we only need to import it and load it as a dataframe:
from sklearn.datasets import fetch_california_housing
# as_frame=True loads the data in a dataframe format, with other metadata besides it
california_housing = fetch_california_housing(as_frame=True)
# Select only the dataframe part and assign it to the df variable
df = california_housing.frame
Importing the data directly from Scikit-Learn, imports more than only the columns and numbers and includes the data description as a Bunch
object - so we've just extracted the frame
. Further details of the dataset are available here.
Let's import Pandas and take a peek at the first few rows of data:
import pandas as pd
df.head()
Executing the code will display the first five rows of our dataset:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
In this guide, we will use MedInc
, HouseAge
, AveRooms
, AveBedrms
, Population
, AveOccup
, Latitude
, Longitude
to predict MedHouseVal
. Something similar to our motivation narrative.
Let's now jump right into the implementation of the KNN algorithm for the regression.
Regression with K-Nearest Neighbors with Scikit-Learn
So far, we got to know our dataset and now can proceed to other steps in the KNN algorithm.
Preprocessing Data for KNN Regression
The preprocessing is where the first differences between the regression and classification tasks appear. Since this section is all about regression, we'll prepare our dataset accordingly.
For the regression, we need to predict another median house value. To do so, we will assign MedHouseVal
to y
and all other columns to X
just by dropping MedHouseVal
:
y = df['MedHouseVal']
X = df.drop(['MedHouseVal'], axis = 1)
By looking at our variables’ descriptions, we can see that we have differences in measurements. To avoid guessing, let's use the describe()
method to check:
# .T transposes the results, transforming rows into columns
X.describe().T
This results in:
count mean std min 25% 50% 75% max
MedInc 20640.0 3.870671 1.899822 0.499900 2.563400 3.534800 4.743250 15.000100
HouseAge 20640.0 28.639486 12.585558 1.000000 18.000000 29.000000 37.000000 52.000000
AveRooms 20640.0 5.429000 2.474173 0.846154 4.440716 5.229129 6.052381 141.909091
AveBedrms 20640.0 1.096675 0.473911 0.333333 1.006079 1.048780 1.099526 34.066667
Population 20640.0 1425.476744 1132.462122 3.000000 787.000000 1166.000000 1725.000000 35682.000000
AveOccup 20640.0 3.070655 10.386050 0.692308 2.429741 2.818116 3.282261 1243.333333
Latitude 20640.0 35.631861 2.135952 32.540000 33.930000 34.260000 37.710000 41.950000
Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010000 -114.310000
Here, we can see that the mean
value of MedInc
is approximately 3.87
and the mean
value of HouseAge
is about 28.64
, making it 7.4 times larger than MedInc
. Other features also have differences in mean and standard deviation - to see that, look at the mean
and std
values and observe how they are distant from each other. For MedInc
std
is approximately 1.9
, for HouseAge
, std
is 12.59
and the same applies to the other features.
We're using an algorithm based on distance and distance-based algorithms suffer greatly from data that isn't on the same scale, such as this data. The scale of the points may (and in practice, almost always does) distort the real distance between values.
To perform Feature Scaling, we will use Scikit-Learn's StandardScaler
class later. If we apply the scaling right now (before a train-test split), the calculation would include test data, effectively leaking test data information into the rest of the pipeline. This sort of data leakage is unfortunately commonly skipped, resulting in irreproducible or illusory findings.
Advice: If you'd like to learn more about feature scaling - read our "Feature Scaling Data with Scikit-Learn for Machine Learning in Python"
Splitting Data into Train and Test Sets
To be able to scale our data without leakage, but also to evaluate our results and to avoid over-fitting, we'll divide our dataset into train and test splits.
A straightforward way to create train and test splits is the train_test_split
method from Scikit-Learn. The split doesn't linearly split at some point, but samples X% and Y% randomly. To make this process reproducible (to make the method always sample the same data points), we'll set the random_state
argument to a certain SEED
:
from sklearn.model_selection import train_test_split
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)
This piece of code samples 75% of the data for training and 25% of the data for testing. By changing the test_size
to 0.3, for instance, you could train with 70% of the data and test with 30%.
By using 75% of the data for training and 25% for testing, out of 20640 records, the training set contains 15480 and the test set contains 5160. We can inspect those numbers quickly by printing the lengths of the full dataset and of split data:
len(X) # 20640
len(X_train) # 15480
len(X_test) # 5160
Great! We can now fit the data scaler on the X_train
set, and scale both X_train
and X_test
without leaking any data from X_test
into X_train
.
Advice: If you'd like to learn more about the train_test_split()
method the importance of a train-test-validation split, as well as how to separate out validation sets as well, read our "Scikit-Learn's train_test_split() - Training, Testing and Validation Sets".
Feature Scaling for KNN Regression
By importing StandardScaler
, instantiating it, fitting it according to our train data (preventing leakage), and transforming both train and test datasets, we can perform feature scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only on X_train
scaler.fit(X_train)
# Scale both X_train and X_test
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Note: Since you'll oftentimes call scaler.fit(X_train)
followed by scaler.transform(X_train)
- you can call a single scaler.fit_transform(X_train)
followed by scaler.transform(X_test)
to make the call shorter!
Now our data is scaled! The scaler maintains only the data points, and not the column names, when applied on a DataFrame
. Let's organize the data into a DataFrame again with column names and use describe()
to observe the changes in mean
and std
:
col_names=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
scaled_df = pd.DataFrame(X_train, columns=col_names)
scaled_df.describe().T
This will give us:
count mean std min 25% 50% 75% max
MedInc 15480.0 2.074711e-16 1.000032 -1.774632 -0.688854 -0.175663 0.464450 5.842113
HouseAge 15480.0 -1.232434e-16 1.000032 -2.188261 -0.840224 0.032036 0.666407 1.855852
AveRooms 15480.0 -1.620294e-16 1.000032 -1.877586 -0.407008 -0.083940 0.257082 56.357392
AveBedrms 15480.0 7.435912e-17 1.000032 -1.740123 -0.205765 -0.108332 0.007435 55.925392
Population 15480.0 -8.996536e-17 1.000032 -1.246395 -0.558886 -0.227928 0.262056 29.971725
AveOccup 15480.0 1.055716e-17 1.000032 -0.201946 -0.056581 -0.024172 0.014501 103.737365
Latitude 15480.0 7.890329e-16 1.000032 -1.451215 -0.799820 -0.645172 0.971601 2.953905
Longitude 15480.0 2.206676e-15 1.000032 -2.380303 -1.106817 0.536231 0.785934 2.633738
Observe how all standard deviations are now 1
and the means have become smaller. This is what makes our data more uniform! Let's train and evaluate a KNN-based regressor.
Training and Predicting KNN Regression
Scikit-Learn's intuitive and stable API makes training regressors and classifiers very straightforward. Let's import the KNeighborsRegressor
class from the sklearn.neighbors
module, instantiate it, and fit it to our train data:
from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(n_neighbors=5)
regressor.fit(X_train, y_train)
In the above code, the n_neighbors
is the value for K, or the number of neighbors the algorithm will take into consideration for choosing a new median house value. 5
is the default value for KNeighborsRegressor()
. There is no ideal value for K and it is selected after testing and evaluation, however, to start out, 5
is a commonly used value for KNN and was thus set as the default value.
The final step is to make predictions on our test data. To do so, execute the following script:
y_pred = regressor.predict(X_test)
We can now evaluate how well our model generalizes to new data that we have labels (ground truth) for - the test set!
Evaluating the Algorithm for KNN Regression
The most commonly used regression metrics for evaluating the algorithm are mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination (R2):
- Mean Absolute Error (MAE): When we subtract the predicted values from the actual values, obtain the errors, sum the absolute values of those errors and get their mean. This metric gives a notion of the overall error for each prediction of the model, the smaller (closer to 0) the better:
$$
mae = (\frac{1}{n})\sum_{i=1}^{n}\left | Actual - Predicted \right |
$$
Note: You may also encounter the y
and ŷ
(read as y-hat) notation in the equations. The y
refers to the actual values and the ŷ
to the predicted values.
- Mean Squared Error (MSE): It is similar to the MAE metric, but it squares the absolute values of the errors. Also, as with MAE, the smaller, or closer to 0, the better. The MSE value is squared so as to make large errors even larger. One thing to pay close attention to, is that it is usually a hard metric to interpret due to the size of its values and of the fact that they aren't on the same scale as the data.
$$
mse = \sum_{i=1}^{D}(Actual - Predicted)^2
$$
- Root Mean Squared Error (RMSE): Tries to solve the interpretation problem raised with the MSE by getting the square root of its final value, so as to scale it back to the same units of the data. It is easier to interpret and good when we need to display or show the actual value of the data with the error. It shows how much the data may vary, so, if we have an RMSE of 4.35, our model can make an error either because it added 4.35 to the actual value, or needed 4.35 to get to the actual value. The closer to 0, the better as well.
$$
rmse = \sqrt{ \sum_{i=1}^{D}(Actual - Predicted)^2}
$$
The mean_absolute_error()
and mean_squared_error()
methods of sklearn.metrics
can be used to calculate these metrics as can be seen in the following snippet:
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'mae: {mae}')
print(f'mse: {mse}')
print(f'rmse: {rmse}')
The output of the above script looks like this:
mae: 0.4460739527131783
mse: 0.4316907430948294
rmse: 0.6570317671884894
The R2 can be calculated directly with the score()
method:
regressor.score(X_test, y_test)
Which outputs:
0.6737569252627673
The results show that our KNN algorithm overall error and mean error are around 0.44
, and 0.43
. Also, the RMSE shows that we can go above or below the actual value of data by adding 0.65
or subtracting 0.65
. How good is that?
Let's check what the prices look like:
y.describe()
count 20640.000000
mean 2.068558
std 1.153956
min 0.149990
25% 1.196000
50% 1.797000
75% 2.647250
max 5.000010
Name: MedHouseVal, dtype: float64
The mean is 2.06
and the standard deviation from the mean is 1.15
so our score of ~0.44
isn't really stellar, but isn't too bad.
With the R2, the closest to 1 we get (or 100), the better. The R2 tells how much of the changes in data, or data variance are being understood or explained by KNN.
$$
R^2 = 1 - \frac{\sum(Actual - Predicted)^2}{\sum(Actual - Actual \ Mean)^2}
$$
With a value of 0.67
, we can see that our model explains 67% of the data variance. It is already more than 50%, which is ok, but not very good. Is there any way we could do better?
We have used a predetermined K with a value of 5
, so, we are using 5 neighbors to predict our targets which is not necessarily the best number. To understand which would be an ideal number of Ks, we can analyze our algorithm errors and choose the K that minimizes the loss.
Finding the Best K for KNN Regression
Ideally, you would see which metric fits more into your context - but it is usually interesting to test all metrics. Whenever you can test all of them, do it. Here, we will show how to choose the best K using only the mean absolute error, but you can change it to any other metric and compare the results.
To do this, we will create a for loop and run models that have from 1 to X neighbors. At each interaction, we will calculate the MAE and plot the number of Ks along with the MAE result:
error = []
# Calculating MAE error for K values between 1 and 39
for i in range(1, 40):
knn = KNeighborsRegressor(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
mae = mean_absolute_error(y_test, pred_i)
error.append(mae)
Now, let's plot the error
s:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red',
linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('K Value MAE')
plt.xlabel('K Value')
plt.ylabel('Mean Absolute Error')
Looking at the plot, it seems the lowest MAE value is when K is 12
. Let's get a closer look at the plot to be sure by plotting less data:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 15), error[:14], color='red',
linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('K Value MAE')
plt.xlabel('K Value')
plt.ylabel('Mean Absolute Error')
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
You can also obtain the lowest error and the index of that point using the built-in min()
function (works on lists) or convert the list into a NumPy array and get the argmin()
(index of the element with the lowest value):
import numpy as np
print(min(error)) # 0.43631325936692505
print(np.array(error).argmin()) # 11
We started counting neighbors on 1, while arrays are 0-based, so the 11th index is 12 neighbors!
This means that we need 12 neighbors to be able to predict a point with the lowest MAE error. We can execute the model and metrics again with 12 neighbors to compare results:
knn_reg12 = KNeighborsRegressor(n_neighbors=12)
knn_reg12.fit(X_train, y_train)
y_pred12 = knn_reg12.predict(X_test)
r2 = knn_reg12.score(X_test, y_test)
mae12 = mean_absolute_error(y_test, y_pred12)
mse12 = mean_squared_error(y_test, y_pred12)
rmse12 = mean_squared_error(y_test, y_pred12, squared=False)
print(f'r2: {r2}, \nmae: {mae12} \nmse: {mse12} \nrmse: {rmse12}')
The following code outputs:
r2: 0.6887495617137436,
mae: 0.43631325936692505
mse: 0.4118522151025172
rmse: 0.6417571309323467
With 12 neighbors our KNN model now explains 69% of the variance in the data, and has lost a little less, going from 0.44
to 0.43
, 0.43
to 0.41
, and 0.65
to 0.64
with the respective metrics. It is not a very large improvement, but it is an improvement nonetheless.
Note: Going further in this analysis, doing an Exploratory Data Analysis (EDA) along with residual analysis may help to select features and achieve better results.
We have already seen how to use KNN for regression - but what if we wanted to classify a point instead of predicting its value? Now, we can look at how to use KNN for classification.
Classification using K-Nearest Neighbors with Scikit-Learn
In this task, instead of predicting a continuous value, we want to predict the class to which these block groups belong. To do that, we can divide the median house value for districts into groups with different house value ranges or bins.
When you want to use a continuous value for classification, you can usually bin the data. In this way, you can predict groups, instead of values.
Preprocessing Data for Classification
Let's create the data bins to transform our continuous values into categories:
# Creating 4 categories and assigning them to a MedHouseValCat column
df["MedHouseValCat"] = pd.qcut(df["MedHouseVal"], 4, retbins=False, labels=[1, 2, 3, 4])
Then, we can split our dataset into its attributes and labels:
y = df['MedHouseValCat']
X = df.drop(['MedHouseVal', 'MedHouseValCat'], axis = 1)
Since we have used the MedHouseVal
column to create bins, we need to drop the MedHouseVal
column and MedHouseValCat
columns from X
. This way, the DataFrame
will contain the first 8 columns of the dataset (i.e. attributes, features) while our y
will contain only the MedHouseValCat
assigned label.
Note: You can also select columns using .iloc()
instead of dropping them. When dropping, just be aware you need to assign y
values before assigning X
values, because you can't assign a dropped column of a DataFrame
to another object in memory.
Splitting Data into Train and Test Sets
As it has been done with regression, we will also divide the dataset into training and test splits. Since we have different data, we need to repeat this process:
from sklearn.model_selection import train_test_split
SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)
We will use the standard Scikit-Learn value of 75% train data and 25% test data again. This means we will have the same train and test number of records as in the regression before.
Feature Scaling for Classification
Since we are dealing with the same unprocessed dataset and its varying measure units, we will perform feature scaling again, in the same way as we did for our regression data:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Training and Predicting for Classification
After binning, splitting, and scaling the data, we can finally fit a classifier on it. For the prediction, we will use 5 neighbors again as a baseline. You can also instantiate the KNeighbors_
class without any arguments and it will automatically use 5 neighbors. Here, instead of importing the KNeighborsRegressor
, we will import the KNeighborsClassifier
, class:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)
After fitting the KNeighborsClassifier
, we can predict the classes of the test data:
y_pred = classifier.predict(X_test)
Time to evaluate the predictions! Would predicting classes be a better approach than predicting values in this case? Let's evaluate the algorithm to see what happens.
Evaluating KNN for Classification
For evaluating the KNN classifier, we can also use the score
method, but it executes a different metric since we are scoring a classifier and not a regressor. The basic metric for classification is accuracy
- it describes how many predictions our classifier got right. The lowest accuracy value is 0 and the highest is 1. We usually multiply that value by 100 to obtain a percentage.
$$
accuracy = \frac{\text{number of correct predictions}}{\text{total number of predictions}}
$$
Note: It is extremely hard to obtain 100% accuracy on any real data, if that happens, be aware that some leakage or something wrong might be happening - there is no consensus on an ideal accuracy value and it is also context-dependent. Depending on the cost of error (how bad it is if we trust the classifier and it turns out to be wrong), an acceptable error rate might be 5%, 10% or even 30%.
Let's score our classifier:
acc = classifier.score(X_test, y_test)
print(acc) # 0.6191860465116279
By looking at the resulting score, we can deduce that our classifier got ~62% of our classes right. This already helps in the analysis, although by only knowing what the classifier got right, it is difficult to improve it.
There are 4 classes in our dataset - what if our classifier got 90% of classes 1, 2, and 3 right, but only 30% of class 4 right?
A systemic failure of some class, as opposed to a balanced failure shared between classes can both yield a 62% accuracy score. Accuracy isn't a really good metric for actual evaluation - but does serve as a good proxy. More often than not, with balanced datasets, a 62% accuracy is relatively evenly spread. Also, more often than not, datasets aren't balanced, so we're back at square one with accuracy being an insufficient metric.
We can look deeper into the results using other metrics to be able to determine that. This step is also different from the regression, here we will use:
- Confusion Matrix: To know how much we got right or wrong for each class. The values that were correct and correctly predicted are called true positives the ones that were predicted as positives but weren't positives are called false positives. The same nomenclature of true negatives and false negatives is used for negative values;
- Precision: To understand what correct prediction values were considered correct by our classifier. Precision will divide those true positives values by anything that was predicted as a positive;
$$
precision = \frac{\text{true positive}}{\text{true positive} + \text{false positive}}
$$
- Recall: to understand how many of the true positives were identified by our classifier. The recall is calculated by dividing the true positives by anything that should have been predicted as positive.
$$
recall = \frac{\text{true positive}}{\text{true positive} + \text{false negative}}
$$
- F1 score: Is the balanced or harmonic mean of precision and recall. The lowest value is 0 and the highest is 1. When
f1-score
is equal to 1, it means all classes were correctly predicted - this is a very hard score to obtain with real data (exceptions almost always exist).
$$
\text{f1-score} = 2* \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}}
$$
Note: A weighted F1 score also exists, and it's just an F1 that doesn't apply the same weight to all classes. The weight is typically dictated by the classes support - how many instances "support" the F1 score (the proportion of labels belonging to a certain class). The lower the support (the fewer instances of a class), the lower the weighted F1 for that class, because it's more unreliable.
The confusion_matrix()
and classification_report()
methods of the sklearn.metrics
module can be used to calculate and display all these metrics. The confusion_matrix
is better visualized using a heatmap. The classification report already gives us accuracy
, precision
, recall
, and f1-score
, but you could also import each of these metrics from sklearn.metrics
.
To obtain metrics, execute the following snippet:
from sklearn.metrics import classification_report, confusion_matrix
#importing Seaborn's to use the heatmap
import seaborn as sns
# Adding classes names for better interpretation
classes_names = ['class 1','class 2','class 3', 'class 4']
cm = pd.DataFrame(confusion_matrix(yc_test, yc_pred),
columns=classes_names, index = classes_names)
# Seaborn's heatmap to better visualize the confusion matrix
sns.heatmap(cm, annot=True, fmt='d');
print(classification_report(y_test, y_pred))
The output of the above script looks like this:
precision recall f1-score support
1 0.75 0.78 0.76 1292
2 0.49 0.56 0.53 1283
3 0.51 0.51 0.51 1292
4 0.76 0.62 0.69 1293
accuracy 0.62 5160
macro avg 0.63 0.62 0.62 5160
weighted avg 0.63 0.62 0.62 5160
The results show that KNN was able to classify all the 5160 records in the test set with 62% accuracy, which is above average. The supports are fairly equal (even distribution of classes in the dataset), so the weighted F1 and unweighted F1 are going to be roughly the same.
We can also see the result of the metrics for each of the 4 classes. From that, we are able to notice that class 2
had the lowest precision, lowest recall
, and lowest f1-score
. Class 3
is right behind class 2
for having the lowest scores, and then, we have class 1
with the best scores followed by class 4
.
By looking at the confusion matrix, we can see that:
class 1
was mostly mistaken forclass 2
in 238 casesclass 2
forclass 1
in 256 entries, and forclass 3
in 260 casesclass 3
was mostly mistaken byclass 2
, 374 entries, andclass 4
, in 193 casesclass 4
was wrongly classified asclass 3
for 339 entries, and asclass 2
in 130 cases.
Also, notice that the diagonal displays the true positive values, when looking at it, it is plain to see that class 2
and class 3
have the least correctly predicted values.
With those results, we could go deeper into the analysis by further inspecting them to figure out why that happened, and also understanding if 4 classes are the best way to bin the data. Perhaps values from class 2
and class 3
were too close to each other, so it became hard to tell them apart.
Always try to test the data with a different number of bins to see what happens.
Besides the arbitrary number of data bins, there is also another arbitrary number that we have chosen, the number of K neighbors. The same technique we applied to the regression task can be applied to the classification when determining the number of Ks that maximize or minimize a metric value.
Finding the Best K for KNN Classification
Let's repeat what has been done for regression and plot the graph of K values and the corresponding metric for the test set. You can also choose which metric better fits your context, here, we will choose f1-score
.
In this way, we will plot the f1-score
for the predicted values of the test set for all the K values between 1 and 40.
First, we import the f1_score
from sklearn.metrics
and then calculate its value for all the predictions of a K-Nearest Neighbors classifier, where K ranges from 1 to 40:
from sklearn.metrics import f1_score
f1s = []
# Calculating f1 score for K values between 1 and 40
for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
# using average='weighted' to calculate a weighted average for the 4 classes
f1s.append(f1_score(y_test, pred_i, average='weighted'))
The next step is to plot the f1_score
values against K values. The difference from the regression is that instead of choosing the K value that minimizes the error, this time we will choose the value that maximizes the f1-score
.
Execute the following script to create the plot:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), f1s, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('F1 Score K Value')
plt.xlabel('K Value')
plt.ylabel('F1 Score')
The output graph looks like this:
From the output, we can see that the f1-score
is the highest when the value of the K is 15
. Let's retrain our classifier with 15 neighbors and see what it does to our classification report results:
classifier15 = KNeighborsClassifier(n_neighbors=15)
classifier15.fit(X_train, y_train)
y_pred15 = classifier15.predict(X_test)
print(classification_report(y_test, y_pred15))
This outputs:
precision recall f1-score support
1 0.77 0.79 0.78 1292
2 0.52 0.58 0.55 1283
3 0.51 0.53 0.52 1292
4 0.77 0.64 0.70 1293
accuracy 0.63 5160
macro avg 0.64 0.63 0.64 5160
weighted avg 0.64 0.63 0.64 5160
Notice that our metrics have improved with 15 neighbors, we have 63% accuracy and higher precision
, recall
, and f1-scores
, but we still need to further look at the bins to try to understand why the f1-score
for classes 2
and 3
is still low.
Besides using KNN for regression and determining block values and for classification, to determine block classes - we can also use KNN for detecting which mean blocks values are different from most - the ones that don't follow what most of the data is doing. In other words, we can use KNN for detecting outliers.
Implementing KNN for Outlier Detection with Scikit-Learn
Outlier detection uses another method that differs from what we had done previously for regression and classification.
Here, we will see how far each of the neighbors is from a data point. Let's use the default 5 neighbors. For a data point, we will calculate the distance to each of the K-nearest neighbors. To do that, we will import another KNN algorithm from Scikit-learn which is not specific for either regression or classification called simply NearestNeighbors
.
After importing, we will instantiate a NearestNeighbors
class with 5 neighbors - you can also instantiate it with 12 neighbors to identify outliers in our regression example or with 15, to do the same for the classification example. We will then fit our train data and use the kneighbors()
method to find our calculated distances for each data point and neighbors indexes:
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors = 5)
nbrs.fit(X_train)
# Distances and indexes of the 5 neighbors
distances, indexes = nbrs.kneighbors(X_train)
Now we have 5 distances for each data point - the distance between itself and its 5 neighbors, and an index that identifies them. Let's take a peek at the first three results and the shape of the array to visualize this better.
To look at the first three distances shape, execute:
distances[:3], distances.shape
(array([[0. , 0.12998939, 0.15157687, 0.16543705, 0.17750354],
[0. , 0.25535314, 0.37100754, 0.39090243, 0.40619693],
[0. , 0.27149697, 0.28024623, 0.28112326, 0.30420656]]),
(3, 5))
Observe that there are 3 rows with 5 distances each. We can also look and the neighbors' indexes:
indexes[:3], indexes[:3].shape
This results in:
(array([[ 0, 8608, 12831, 8298, 2482],
[ 1, 4966, 5786, 8568, 6759],
[ 2, 13326, 13936, 3618, 9756]]),
(3, 5))
In the output above, we can see the indexes of each of the 5 neighbors. Now, we can continue to calculate the mean of the 5 distances and plot a graph that counts each row on the X-axis and displays each mean distance on the Y-axis:
dist_means = distances.mean(axis=1)
plt.plot(dist_means)
plt.title('Mean of the 5 neighbors distances for each data point')
plt.xlabel('Count')
plt.ylabel('Mean Distances')
Notice that there is a part of the graph in which the mean distances have uniform values. That Y-axis point in which the means aren't too high or too low is exactly the point we need to identify to cut off the outlier values.
In this case, it is where the mean distance is 3. Let's plot the graph again with a horizontal dotted line to be able to spot it:
dist_means = distances.mean(axis=1)
plt.plot(dist_means)
plt.title('Mean of the 5 neighbors distances for each data point with cut-off line')
plt.xlabel('Count')
plt.ylabel('Mean Distances')
plt.axhline(y = 3, color = 'r', linestyle = '--')
This line marks the mean distance for which above it all values vary. This means that all points with a mean
distance above 3
are our outliers. We can find out the indexes of those points using np.where()
. This method will output either True
or False
for each index in regards to the mean
above 3 condition:
import numpy as np
# Visually determine cutoff values > 3
outlier_index = np.where(dist_means > 3)
outlier_index
The above code outputs:
(array([ 564, 2167, 2415, 2902, 6607, 8047, 8243, 9029, 11892,
12127, 12226, 12353, 13534, 13795, 14292, 14707]),)
Now we have our outlier point indexes. Let's locate them in the dataframe:
# Filter outlier values
outlier_values = df.iloc[outlier_index]
outlier_values
This results in:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
564 4.8711 27.0 5.082811 0.944793 1499.0 1.880803 37.75 -122.24 2.86600
2167 2.8359 30.0 4.948357 1.001565 1660.0 2.597809 36.78 -119.83 0.80300
2415 2.8250 32.0 4.784232 0.979253 761.0 3.157676 36.59 -119.44 0.67600
2902 1.1875 48.0 5.492063 1.460317 129.0 2.047619 35.38 -119.02 0.63800
6607 3.5164 47.0 5.970639 1.074266 1700.0 2.936097 34.18 -118.14 2.26500
8047 2.7260 29.0 3.707547 1.078616 2515.0 1.977201 33.84 -118.17 2.08700
8243 2.0769 17.0 3.941667 1.211111 1300.0 3.611111 33.78 -118.18 1.00000
9029 6.8300 28.0 6.748744 1.080402 487.0 2.447236 34.05 -118.78 5.00001
11892 2.6071 45.0 4.225806 0.903226 89.0 2.870968 33.99 -117.35 1.12500
12127 4.1482 7.0 5.674957 1.106998 5595.0 3.235975 33.92 -117.25 1.24600
12226 2.8125 18.0 4.962500 1.112500 239.0 2.987500 33.63 -116.92 1.43800
12353 3.1493 24.0 7.307323 1.460984 1721.0 2.066026 33.81 -116.54 1.99400
13534 3.7949 13.0 5.832258 1.072581 2189.0 3.530645 34.17 -117.33 1.06300
13795 1.7567 8.0 4.485173 1.120264 3220.0 2.652389 34.59 -117.42 0.69500
14292 2.6250 50.0 4.742236 1.049689 728.0 2.260870 32.74 -117.13 2.03200
14707 3.7167 17.0 5.034130 1.051195 549.0 1.873720 32.80 -117.05 1.80400
Our outlier detection is finished. This is how we spot each data point that deviates from the general data trend. We can see that there are 16 points in our train data that should be further looked at, investigated, maybe treated, or even removed from our data (if they were erroneously input) to improve results. Those points might have resulted from typing errors, mean block values inconsistencies, or even both.
Pros and Cons of KNN
In this section, we'll present some of the pros and cons of using the KNN algorithm.
Pros
- It is easy to implement
- It is a lazy learning algorithm and therefore doesn't require training on all data points (only using the K-Nearest neighbors to predict). This makes the KNN algorithm much faster than other algorithms that require training with the whole dataset such as Support Vector Machines, linear regression, etc.
- Since KNN requires no training before making predictions, new data can be added seamlessly
- There are only two parameters required to work with KNN, i.e. the value of K and the distance function
Cons
- The KNN algorithm doesn't work well with high dimensional data because with a large number of dimensions, the distance between points gets "weird", and the distance metrics we use don't hold up
- Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to find the distance between dimensions with categorical features
Conclusion
KNN is a simple yet powerful algorithm. It can be used for many tasks such as regression, classification, or outlier detection.
KNN has been widely used to find document similarity and pattern recognition. It has also been employed for developing recommender systems and for dimensionality reduction and pre-processing steps for computer vision - particularly face recognition tasks.
In this guide - we've gone through regression, classification and outlier detection using Scikit-Learn's implementation of the K-Nearest Neighbor algorithm.