HomeArticles

Guide to the K-Nearest Neighbors Algorithm in Python and Scikit-Learn

Introduction

The K-nearest Neighbors (KNN) algorithm is a type of supervised machine learning algorithm used for classification, regression as well as outlier detection. It is extremely easy to implement in its most basic form but can perform fairly complex tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses all of the data for training while classifying (or regressing) a new data point or instance.

KNN is a non-parametric learning algorithm, which means that it doesn't assume anything about the underlying data. This is an extremely useful feature since most of the real-world data doesn't really follow any theoretical assumption e.g. linear separability, uniform distribution, etc.

In this guide, we will see how KNN can be implemented with Python's Scikit-Learn library. Before that we'll first explore how we can use KNN and explain the theory behind it. After that, we'll take a look at the California Housing dataset we'll be using to illustrate the KNN algorithm and several of its variations. First of all, we'll take a look at how to implement the KNN algorithm for the regression, followed by implementations of the KNN classification and the outlier detection. In the end, we'll conclude with some of the pros and cons of the algorithm.

When Should You Use KNN?

Suppose you wanted to rent an apartment and recently found out your friend's neighbor might put her apartment up for rent in 2 weeks. Since the apartment isn't on a rental website yet, how could you try to estimate its rental value?

Let's say your friend pays $1,200 in rent. Your rent value might be around that number, but the apartments aren't exactly the same (orientation, area, furniture quality, etc.), so, it would be nice to have more data on other apartments.

By asking other neighbors and looking at the apartments from the same building that were listed on a rental website, the closest three neighboring apartment rents are $1,200, $1,210, $1,210, and $1,215. Those apartments are on the same block and floor as your friend's apartment.

Other apartments that are further away, on the same floor, but in a different block have rents of $1,400, $1,430, $1,500, and $1,470. It seems they are more expensive due to having more light from the sun in the evening.

Considering the apartment's proximity, it seems your estimated rent would be around $1,210. That is the general idea of what the K-Nearest Neighbors (KNN) algorithm does! It classifies or regresses new data based on its proximity to already existing data.

Translate the Example into Theory

When the estimated value is a continuous number, such as the rent value, KNN is used for regression. But we could also divide apartments into categories based on the minimum and maximum rent, for instance. When the value is discrete, making it a category, KNN is used for classification.

There is also the possibility of estimating which neighbors are so different from others that they will probably stop paying rent. This is the same as detecting which data points are so far away that they don't fit into any value or category, when that happens, KNN is used for outlier detection.

In our example, we also already knew the rents of each apartment, which means our data was labeled. KNN uses previously labeled data, which makes it a supervised learning algorithm.

KNN is extremely easy to implement in its most basic form, and yet performs quite complex classification, regression, or outlier detection tasks.

Each time there is a new point added to the data, KNN uses just one part of the data for deciding the value (regression) or class (classification) of that added point. Since it doesn't have to look at all the points again, this makes it a lazy learning algorithm.

KNN also doesn't assume anything about the underlying data characteristics, it doesn't expect the data to fit into some type of distribution, such as uniform, or to be linearly separable. This means it is a non-parametric learning algorithm. This is an extremely useful feature since most of the real-world data doesn't really follow any theoretical assumption.

Visualizing Different Uses of the KNN

As it has been shown, the intuition behind the KNN algorithm is one of the most direct of all the supervised machine learning algorithms. The algorithm first calculates the distance of a new data point to all other training data points.

Note: The distance can be measured in different ways. You can use a Minkowski, Euclidean, Manhattan, Mahalanobis or Hamming formula, to name a few metrics. With high dimensional data, Euclidean distance often starts failing (high dimensionality is... weird), and Manhattan distance is used instead.

After calculating the distance, KNN selects a number of nearest data points - 2, 3, 10, or really, any integer. This number of points (2, 3, 10, etc.) is the K in K-Nearest Neighbors!

In the final step, if it is a regression task, KNN will calculate the average weighted sum of the K-nearest points for the prediction. If it is a classification task, the new data point will be assigned to the class to which the majority of the selected K-nearest points belong.

Let's visualize the algorithm in action with the help of a simple example. Consider a dataset with two variables and a K of 3.

When performing regression, the task is to find the value of a new data point, based on the average weighted sum of the 3 nearest points.

KNN with K = 3, when used for regression:

The KNN algorithm will start by calculating the distance of the new point from all the points. It then finds the 3 points with the least distance to the new point. This is shown in the second figure above, in which the three nearest points, 47, 58, and 79 have been encircled. After that, it calculates the weighted sum of 47, 58 and 79 - in this case the weights are equal to 1 - we are considering all points as equals, but we could also assign different weights based on distance. After calculating the weighted sum, the new point value is 61,33.

And when performing a classification, the KNN task is to classify a new data point, into the "Purple" or "Red" class.

KNN with K = 3, when used for classification:

The KNN algorithm will start in the same way as before, by calculating the distance of the new point from all the points, finding the 3 nearest points with the least distance to the new point, and then, instead of calculating a number, it assigns the new point to the class to which majority of the three nearest points belong, the red class. Therefore the new data point will be classified as "Red".

The outlier detection process is different from both above, we will talk more about it when implementing it after the regression and classification implementations.

Note: The code provided in this tutorial has been executed and tested with the following Jupyter notebook.

The Scikit-Learn California Housing Dataset

We are going to use the California housing dataset to illustrate how the KNN algorithm works. The dataset was derived from the 1990 U.S. census. One row of the dataset represents the census of one block group.

In this section, we'll go over the details of the California Housing Dataset, so you can gain an intuitive understanding of the data we'll be working with. It's very important to get to know your data before you start working on it.

A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data. Besides block group, another term used is household, a household is a group of people residing within a home.

The dataset consists of nine attributes:

MedInc - median income in block group
HouseAge - median house age in a block group
AveRooms - the average number of rooms (provided per household)
AveBedrms - the average number of bedrooms (provided per household)
Population - block group population
AveOccup - the average number of household members
Latitude - block group latitude
Longitude - block group longitude
MedHouseVal - median house value for California districts (hundreds of thousands of dollars)

The dataset is already part of the Scikit-Learn library, we only need to import it and load it as a dataframe:

from sklearn.datasets import fetch_california_housing
# as_frame=True loads the data in a dataframe format, with other metadata besides it
california_housing = fetch_california_housing(as_frame=True)
# Select only the dataframe part and assign it to the df variable
df = california_housing.frame

Importing the data directly from Scikit-Learn, imports more than only the columns and numbers and includes the data description as a Bunch object - so we've just extracted the frame. Further details of the dataset are available here.

Let's import Pandas and take a peek at the first few rows of data:

import pandas as pd
df.head()

Executing the code will display the first five rows of our dataset:

    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup   Latitude  Longitude  MedHouseVal
0 	8.3252 	41.0 	  6.984127 	1.023810   322.0 	   2.555556   37.88 	-122.23    4.526
1 	8.3014 	21.0 	  6.238137 	0.971880   2401.0 	   2.109842   37.86 	-122.22    3.585
2 	7.2574 	52.0 	  8.288136 	1.073446   496.0 	   2.802260   37.85 	-122.24    3.521
3 	5.6431 	52.0 	  5.817352 	1.073059   558.0 	   2.547945   37.85 	-122.25    3.413
4 	3.8462 	52.0 	  6.281853 	1.081081   565.0 	   2.181467   37.85 	-122.25    3.422

In this guide, we will use MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude to predict MedHouseVal. Something similar to our motivation narrative.

Let's now jump right into the implementation of the KNN algorithm for the regression.

Regression with K-Nearest Neighbors with Scikit-Learn

So far, we got to know our dataset and now can proceed to other steps in the KNN algorithm.

Preprocessing Data for KNN Regression

The preprocessing is where the first differences between the regression and classification tasks appear. Since this section is all about regression, we'll prepare our dataset accordingly.

For the regression, we need to predict another median house value. To do so, we will assign MedHouseVal to y and all other columns to X just by dropping MedHouseVal:

y = df['MedHouseVal']
X = df.drop(['MedHouseVal'], axis = 1)

By looking at our variables’ descriptions, we can see that we have differences in measurements. To avoid guessing, let's use the describe() method to check:

# .T transposes the results, transforming rows into columns
X.describe().T

This results in:

            count 	  mean 		   std 			min 		25% 		50% 		75% 		max
MedInc 		20640.0   3.870671 	   1.899822 	0.499900 	2.563400 	3.534800 	4.743250 	15.000100
HouseAge 	20640.0   28.639486    12.585558 	1.000000 	18.000000 	29.000000 	37.000000 	52.000000
AveRooms 	20640.0   5.429000 	   2.474173 	0.846154 	4.440716 	5.229129 	6.052381 	141.909091
AveBedrms 	20640.0   1.096675 	   0.473911 	0.333333 	1.006079 	1.048780 	1.099526 	34.066667
Population 	20640.0   1425.476744  1132.462122 	3.000000 	787.000000 	1166.000000 1725.000000 35682.000000
AveOccup 	20640.0   3.070655 	   10.386050 	0.692308 	2.429741 	2.818116 	3.282261 	1243.333333
Latitude 	20640.0   35.631861    2.135952 	32.540000 	33.930000 	34.260000 	37.710000 	41.950000
Longitude 	20640.0   -119.569704  2.003532    -124.350000 -121.800000 	-118.490000 -118.010000 -114.310000

Here, we can see that the mean value of MedInc is approximately 3.87 and the mean value of HouseAge is about 28.64, making it 7.4 times larger than MedInc. Other features also have differences in mean and standard deviation - to see that, look at the mean and std values and observe how they are distant from each other. For MedInc std is approximately 1.9, for HouseAge, std is 12.59 and the same applies to the other features.

We're using an algorithm based on distance and distance-based algorithms suffer greatly from data that isn't on the same scale, such as this data. The scale of the points may (and in practice, almost always does) distort the real distance between values.

To perform Feature Scaling, we will use Scikit-Learn's StandardScaler class later. If we apply the scaling right now (before a train-test split), the calculation would include test data, effectively leaking test data information into the rest of the pipeline. This sort of data leakage is unfortunately commonly skipped, resulting in irreproducible or illusory findings.

Advice: If you'd like to learn more about feature scaling - read our "Feature Scaling Data with Scikit-Learn for Machine Learning in Python"

Splitting Data into Train and Test Sets

To be able to scale our data without leakage, but also to evaluate our results and to avoid over-fitting, we'll divide our dataset into train and test splits.

A straightforward way to create train and test splits is the train_test_split method from Scikit-Learn. The split doesn't linearly split at some point, but samples X% and Y% randomly. To make this process reproducible (to make the method always sample the same data points), we'll set the random_state argument to a certain SEED:

from sklearn.model_selection import train_test_split

SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)

This piece of code samples 75% of the data for training and 25% of the data for testing. By changing the test_size to 0.3, for instance, you could train with 70% of the data and test with 30%.

By using 75% of the data for training and 25% for testing, out of 20640 records, the training set contains 15480 and the test set contains 5160. We can inspect those numbers quickly by printing the lengths of the full dataset and of split data:

len(X)       # 20640
len(X_train) # 15480
len(X_test)  # 5160

Great! We can now fit the data scaler on the X_train set, and scale both X_train and X_test without leaking any data from X_test into X_train.

Advice: If you'd like to learn more about the train_test_split() method the importance of a train-test-validation split, as well as how to separate out validation sets as well, read our "Scikit-Learn's train_test_split() - Training, Testing and Validation Sets".

Feature Scaling for KNN Regression

By importing StandardScaler, instantiating it, fitting it according to our train data (preventing leakage), and transforming both train and test datasets, we can perform feature scaling:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Fit only on X_train
scaler.fit(X_train)

# Scale both X_train and X_test
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Note: Since you'll oftentimes call scaler.fit(X_train) followed by scaler.transform(X_train) - you can call a single scaler.fit_transform(X_train) followed by scaler.transform(X_test) to make the call shorter!

Now our data is scaled! The scaler maintains only the data points, and not the column names, when applied on a DataFrame. Let's organize the data into a DataFrame again with column names and use describe() to observe the changes in mean and std:

col_names=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
scaled_df = pd.DataFrame(X_train, columns=col_names)
scaled_df.describe().T

This will give us:

            count 		mean 			std 		min 		25% 		50% 		75% 		max
MedInc 		15480.0 	2.074711e-16 	1.000032 	-1.774632 	-0.688854 	-0.175663 	0.464450 	5.842113
HouseAge 	15480.0 	-1.232434e-16 	1.000032 	-2.188261 	-0.840224 	0.032036 	0.666407 	1.855852
AveRooms 	15480.0 	-1.620294e-16 	1.000032 	-1.877586 	-0.407008 	-0.083940 	0.257082 	56.357392
AveBedrms 	15480.0 	7.435912e-17 	1.000032 	-1.740123 	-0.205765 	-0.108332 	0.007435 	55.925392
Population 	15480.0 	-8.996536e-17 	1.000032 	-1.246395 	-0.558886 	-0.227928 	0.262056 	29.971725
AveOccup 	15480.0 	1.055716e-17 	1.000032 	-0.201946 	-0.056581 	-0.024172 	0.014501 	103.737365
Latitude 	15480.0 	7.890329e-16 	1.000032 	-1.451215 	-0.799820 	-0.645172 	0.971601 	2.953905
Longitude 	15480.0 	2.206676e-15 	1.000032 	-2.380303 	-1.106817 	0.536231 	0.785934 	2.633738

Observe how all standard deviations are now 1 and the means have become smaller. This is what makes our data more uniform! Let's train and evaluate a KNN-based regressor.

Training and Predicting KNN Regression

Scikit-Learn's intuitive and stable API makes training regressors and classifiers very straightforward. Let's import the KNeighborsRegressor class from the sklearn.neighbors module, instantiate it, and fit it to our train data:

from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(n_neighbors=5)
regressor.fit(X_train, y_train)

In the above code, the n_neighbors is the value for K, or the number of neighbors the algorithm will take into consideration for choosing a new median house value. 5 is the default value for KNeighborsRegressor(). There is no ideal value for K and it is selected after testing and evaluation, however, to start out, 5 is a commonly used value for KNN and was thus set as the default value.

The final step is to make predictions on our test data. To do so, execute the following script:

y_pred = regressor.predict(X_test)

We can now evaluate how well our model generalizes to new data that we have labels (ground truth) for - the test set!

Evaluating the Algorithm for KNN Regression

The most commonly used regression metrics for evaluating the algorithm are mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination (R²):

Mean Absolute Error (MAE): When we subtract the predicted values from the actual values, obtain the errors, sum the absolute values of those errors and get their mean. This metric gives a notion of the overall error for each prediction of the model, the smaller (closer to 0) the better:

$$
mae = (\frac{1}{n})\sum_{i=1}^{n}\left | Actual - Predicted \right |
$$

Note: You may also encounter the y and ŷ (read as y-hat) notation in the equations. The y refers to the actual values and the ŷ to the predicted values.

Mean Squared Error (MSE): It is similar to the MAE metric, but it squares the absolute values of the errors. Also, as with MAE, the smaller, or closer to 0, the better. The MSE value is squared so as to make large errors even larger. One thing to pay close attention to, is that it is usually a hard metric to interpret due to the size of its values and of the fact that they aren't on the same scale as the data.

$$
mse = \sum_{i=1}^{D}(Actual - Predicted)^2
$$

Root Mean Squared Error (RMSE): Tries to solve the interpretation problem raised with the MSE by getting the square root of its final value, so as to scale it back to the same units of the data. It is easier to interpret and good when we need to display or show the actual value of the data with the error. It shows how much the data may vary, so, if we have an RMSE of 4.35, our model can make an error either because it added 4.35 to the actual value, or needed 4.35 to get to the actual value. The closer to 0, the better as well.

$$
rmse = \sqrt{ \sum_{i=1}^{D}(Actual - Predicted)^2}
$$

The mean_absolute_error() and mean_squared_error() methods of sklearn.metrics can be used to calculate these metrics as can be seen in the following snippet:

from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print(f'mae: {mae}')
print(f'mse: {mse}')
print(f'rmse: {rmse}')

The output of the above script looks like this:

mae: 0.4460739527131783 
mse: 0.4316907430948294 
rmse: 0.6570317671884894

The R² can be calculated directly with the score() method:

regressor.score(X_test, y_test)

Which outputs:

0.6737569252627673

The results show that our KNN algorithm overall error and mean error are around 0.44, and 0.43. Also, the RMSE shows that we can go above or below the actual value of data by adding 0.65 or subtracting 0.65. How good is that?

Let's check what the prices look like:

y.describe()

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The mean is 2.06 and the standard deviation from the mean is 1.15 so our score of ~0.44 isn't really stellar, but isn't too bad.

With the R², the closest to 1 we get (or 100), the better. The R² tells how much of the changes in data, or data variance are being understood or explained by KNN.

$$
R^2 = 1 - \frac{\sum(Actual - Predicted)^2}{\sum(Actual - Actual \ Mean)^2}
$$

With a value of 0.67, we can see that our model explains 67% of the data variance. It is already more than 50%, which is ok, but not very good. Is there any way we could do better?

We have used a predetermined K with a value of 5, so, we are using 5 neighbors to predict our targets which is not necessarily the best number. To understand which would be an ideal number of Ks, we can analyze our algorithm errors and choose the K that minimizes the loss.

Finding the Best K for KNN Regression

Ideally, you would see which metric fits more into your context - but it is usually interesting to test all metrics. Whenever you can test all of them, do it. Here, we will show how to choose the best K using only the mean absolute error, but you can change it to any other metric and compare the results.

To do this, we will create a for loop and run models that have from 1 to X neighbors. At each interaction, we will calculate the MAE and plot the number of Ks along with the MAE result:

error = []

# Calculating MAE error for K values between 1 and 39
for i in range(1, 40):
    knn = KNeighborsRegressor(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    mae = mean_absolute_error(y_test, pred_i)
    error.append(mae)

Now, let's plot the errors:

import matplotlib.pyplot as plt 

plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', 
         linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
         
plt.title('K Value MAE')
plt.xlabel('K Value')
plt.ylabel('Mean Absolute Error')

Looking at the plot, it seems the lowest MAE value is when K is 12. Let's get a closer look at the plot to be sure by plotting less data:

plt.figure(figsize=(12, 6))
plt.plot(range(1, 15), error[:14], color='red', 
         linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('K Value MAE')
plt.xlabel('K Value')
plt.ylabel('Mean Absolute Error')

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

You can also obtain the lowest error and the index of that point using the built-in min() function (works on lists) or convert the list into a NumPy array and get the argmin() (index of the element with the lowest value):

import numpy as np 

print(min(error))               # 0.43631325936692505
print(np.array(error).argmin()) # 11

We started counting neighbors on 1, while arrays are 0-based, so the 11th index is 12 neighbors!

This means that we need 12 neighbors to be able to predict a point with the lowest MAE error. We can execute the model and metrics again with 12 neighbors to compare results:

knn_reg12 = KNeighborsRegressor(n_neighbors=12)
knn_reg12.fit(X_train, y_train)
y_pred12 = knn_reg12.predict(X_test)
r2 = knn_reg12.score(X_test, y_test) 

mae12 = mean_absolute_error(y_test, y_pred12)
mse12 = mean_squared_error(y_test, y_pred12)
rmse12 = mean_squared_error(y_test, y_pred12, squared=False)
print(f'r2: {r2}, \nmae: {mae12} \nmse: {mse12} \nrmse: {rmse12}')

The following code outputs:

r2: 0.6887495617137436, 
mae: 0.43631325936692505 
mse: 0.4118522151025172 
rmse: 0.6417571309323467

With 12 neighbors our KNN model now explains 69% of the variance in the data, and has lost a little less, going from 0.44 to 0.43, 0.43 to 0.41, and 0.65 to 0.64 with the respective metrics. It is not a very large improvement, but it is an improvement nonetheless.

Note: Going further in this analysis, doing an Exploratory Data Analysis (EDA) along with residual analysis may help to select features and achieve better results.

We have already seen how to use KNN for regression - but what if we wanted to classify a point instead of predicting its value? Now, we can look at how to use KNN for classification.

Classification using K-Nearest Neighbors with Scikit-Learn

In this task, instead of predicting a continuous value, we want to predict the class to which these block groups belong. To do that, we can divide the median house value for districts into groups with different house value ranges or bins.

When you want to use a continuous value for classification, you can usually bin the data. In this way, you can predict groups, instead of values.

Preprocessing Data for Classification

Let's create the data bins to transform our continuous values into categories:

# Creating 4 categories and assigning them to a MedHouseValCat column
df["MedHouseValCat"] = pd.qcut(df["MedHouseVal"], 4, retbins=False, labels=[1, 2, 3, 4])

Then, we can split our dataset into its attributes and labels:

y = df['MedHouseValCat']
X = df.drop(['MedHouseVal', 'MedHouseValCat'], axis = 1)

Since we have used the MedHouseVal column to create bins, we need to drop the MedHouseVal column and MedHouseValCat columns from X. This way, the DataFrame will contain the first 8 columns of the dataset (i.e. attributes, features) while our y will contain only the MedHouseValCat assigned label.

Note: You can also select columns using .iloc() instead of dropping them. When dropping, just be aware you need to assign y values before assigning X values, because you can't assign a dropped column of a DataFrame to another object in memory.

Splitting Data into Train and Test Sets

As it has been done with regression, we will also divide the dataset into training and test splits. Since we have different data, we need to repeat this process:

from sklearn.model_selection import train_test_split

SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)

We will use the standard Scikit-Learn value of 75% train data and 25% test data again. This means we will have the same train and test number of records as in the regression before.

Feature Scaling for Classification

Since we are dealing with the same unprocessed dataset and its varying measure units, we will perform feature scaling again, in the same way as we did for our regression data:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Training and Predicting for Classification

After binning, splitting, and scaling the data, we can finally fit a classifier on it. For the prediction, we will use 5 neighbors again as a baseline. You can also instantiate the KNeighbors_ class without any arguments and it will automatically use 5 neighbors. Here, instead of importing the KNeighborsRegressor, we will import the KNeighborsClassifier, class:

from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)

After fitting the KNeighborsClassifier, we can predict the classes of the test data:

y_pred = classifier.predict(X_test)

Time to evaluate the predictions! Would predicting classes be a better approach than predicting values in this case? Let's evaluate the algorithm to see what happens.

Evaluating KNN for Classification

For evaluating the KNN classifier, we can also use the score method, but it executes a different metric since we are scoring a classifier and not a regressor. The basic metric for classification is accuracy - it describes how many predictions our classifier got right. The lowest accuracy value is 0 and the highest is 1. We usually multiply that value by 100 to obtain a percentage.

$$
accuracy = \frac{\text{number of correct predictions}}{\text{total number of predictions}}
$$

Note: It is extremely hard to obtain 100% accuracy on any real data, if that happens, be aware that some leakage or something wrong might be happening - there is no consensus on an ideal accuracy value and it is also context-dependent. Depending on the cost of error (how bad it is if we trust the classifier and it turns out to be wrong), an acceptable error rate might be 5%, 10% or even 30%.

Let's score our classifier:

acc =  classifier.score(X_test, y_test)
print(acc) # 0.6191860465116279

By looking at the resulting score, we can deduce that our classifier got ~62% of our classes right. This already helps in the analysis, although by only knowing what the classifier got right, it is difficult to improve it.

There are 4 classes in our dataset - what if our classifier got 90% of classes 1, 2, and 3 right, but only 30% of class 4 right?

A systemic failure of some class, as opposed to a balanced failure shared between classes can both yield a 62% accuracy score. Accuracy isn't a really good metric for actual evaluation - but does serve as a good proxy. More often than not, with balanced datasets, a 62% accuracy is relatively evenly spread. Also, more often than not, datasets aren't balanced, so we're back at square one with accuracy being an insufficient metric.

We can look deeper into the results using other metrics to be able to determine that. This step is also different from the regression, here we will use:

Confusion Matrix: To know how much we got right or wrong for each class. The values that were correct and correctly predicted are called true positives the ones that were predicted as positives but weren't positives are called false positives. The same nomenclature of true negatives and false negatives is used for negative values;
Precision: To understand what correct prediction values were considered correct by our classifier. Precision will divide those true positives values by anything that was predicted as a positive;

$$
precision = \frac{\text{true positive}}{\text{true positive} + \text{false positive}}
$$

Recall: to understand how many of the true positives were identified by our classifier. The recall is calculated by dividing the true positives by anything that should have been predicted as positive.

$$
recall = \frac{\text{true positive}}{\text{true positive} + \text{false negative}}
$$

F1 score: Is the balanced or harmonic mean of precision and recall. The lowest value is 0 and the highest is 1. When f1-score is equal to 1, it means all classes were correctly predicted - this is a very hard score to obtain with real data (exceptions almost always exist).

$$
\text{f1-score} = 2* \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}}
$$

Note: A weighted F1 score also exists, and it's just an F1 that doesn't apply the same weight to all classes. The weight is typically dictated by the classes support - how many instances "support" the F1 score (the proportion of labels belonging to a certain class). The lower the support (the fewer instances of a class), the lower the weighted F1 for that class, because it's more unreliable.

The confusion_matrix() and classification_report() methods of the sklearn.metrics module can be used to calculate and display all these metrics. The confusion_matrix is better visualized using a heatmap. The classification report already gives us accuracy, precision, recall, and f1-score, but you could also import each of these metrics from sklearn.metrics.

To obtain metrics, execute the following snippet:

from sklearn.metrics import classification_report, confusion_matrix
#importing Seaborn's to use the heatmap 
import seaborn as sns

# Adding classes names for better interpretation
classes_names = ['class 1','class 2','class 3', 'class 4']
cm = pd.DataFrame(confusion_matrix(yc_test, yc_pred), 
                  columns=classes_names, index = classes_names)
                  
# Seaborn's heatmap to better visualize the confusion matrix
sns.heatmap(cm, annot=True, fmt='d');

print(classification_report(y_test, y_pred))

The output of the above script looks like this:

              precision    recall  f1-score   support

           1       0.75      0.78      0.76      1292
           2       0.49      0.56      0.53      1283
           3       0.51      0.51      0.51      1292
           4       0.76      0.62      0.69      1293

    accuracy                           0.62      5160
   macro avg       0.63      0.62      0.62      5160
weighted avg       0.63      0.62      0.62      5160

The results show that KNN was able to classify all the 5160 records in the test set with 62% accuracy, which is above average. The supports are fairly equal (even distribution of classes in the dataset), so the weighted F1 and unweighted F1 are going to be roughly the same.

We can also see the result of the metrics for each of the 4 classes. From that, we are able to notice that class 2 had the lowest precision, lowest recall, and lowest f1-score. Class 3 is right behind class 2 for having the lowest scores, and then, we have class 1 with the best scores followed by class 4.

By looking at the confusion matrix, we can see that:

class 1 was mostly mistaken for class 2 in 238 cases
class 2 for class 1 in 256 entries, and for class 3 in 260 cases
class 3 was mostly mistaken by class 2, 374 entries, and class 4, in 193 cases
class 4 was wrongly classified as class 3 for 339 entries, and as class 2 in 130 cases.

Also, notice that the diagonal displays the true positive values, when looking at it, it is plain to see that class 2 and class 3 have the least correctly predicted values.

With those results, we could go deeper into the analysis by further inspecting them to figure out why that happened, and also understanding if 4 classes are the best way to bin the data. Perhaps values from class 2 and class 3 were too close to each other, so it became hard to tell them apart.

Always try to test the data with a different number of bins to see what happens.

Besides the arbitrary number of data bins, there is also another arbitrary number that we have chosen, the number of K neighbors. The same technique we applied to the regression task can be applied to the classification when determining the number of Ks that maximize or minimize a metric value.

Finding the Best K for KNN Classification

Let's repeat what has been done for regression and plot the graph of K values and the corresponding metric for the test set. You can also choose which metric better fits your context, here, we will choose f1-score.

In this way, we will plot the f1-score for the predicted values of the test set for all the K values between 1 and 40.

First, we import the f1_score from sklearn.metrics and then calculate its value for all the predictions of a K-Nearest Neighbors classifier, where K ranges from 1 to 40:

from sklearn.metrics import f1_score

f1s = []

# Calculating f1 score for K values between 1 and 40
for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    # using average='weighted' to calculate a weighted average for the 4 classes 
    f1s.append(f1_score(y_test, pred_i, average='weighted'))

The next step is to plot the f1_score values against K values. The difference from the regression is that instead of choosing the K value that minimizes the error, this time we will choose the value that maximizes the f1-score.

Execute the following script to create the plot:

plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), f1s, color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('F1 Score K Value')
plt.xlabel('K Value')
plt.ylabel('F1 Score')

The output graph looks like this:

From the output, we can see that the f1-score is the highest when the value of the K is 15. Let's retrain our classifier with 15 neighbors and see what it does to our classification report results:

classifier15 = KNeighborsClassifier(n_neighbors=15)
classifier15.fit(X_train, y_train)
y_pred15 = classifier15.predict(X_test)
print(classification_report(y_test, y_pred15))

This outputs:

              precision    recall  f1-score   support

           1       0.77      0.79      0.78      1292
           2       0.52      0.58      0.55      1283
           3       0.51      0.53      0.52      1292
           4       0.77      0.64      0.70      1293

    accuracy                           0.63      5160
   macro avg       0.64      0.63      0.64      5160
weighted avg       0.64      0.63      0.64      5160

Notice that our metrics have improved with 15 neighbors, we have 63% accuracy and higher precision, recall, and f1-scores, but we still need to further look at the bins to try to understand why the f1-score for classes 2 and 3 is still low.

Besides using KNN for regression and determining block values and for classification, to determine block classes - we can also use KNN for detecting which mean blocks values are different from most - the ones that don't follow what most of the data is doing. In other words, we can use KNN for detecting outliers.

Implementing KNN for Outlier Detection with Scikit-Learn

Outlier detection uses another method that differs from what we had done previously for regression and classification.

Here, we will see how far each of the neighbors is from a data point. Let's use the default 5 neighbors. For a data point, we will calculate the distance to each of the K-nearest neighbors. To do that, we will import another KNN algorithm from Scikit-learn which is not specific for either regression or classification called simply NearestNeighbors.

After importing, we will instantiate a NearestNeighbors class with 5 neighbors - you can also instantiate it with 12 neighbors to identify outliers in our regression example or with 15, to do the same for the classification example. We will then fit our train data and use the kneighbors() method to find our calculated distances for each data point and neighbors indexes:

from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors = 5)
nbrs.fit(X_train)
# Distances and indexes of the 5 neighbors 
distances, indexes = nbrs.kneighbors(X_train)

Now we have 5 distances for each data point - the distance between itself and its 5 neighbors, and an index that identifies them. Let's take a peek at the first three results and the shape of the array to visualize this better.

To look at the first three distances shape, execute:

distances[:3], distances.shape

(array([[0.        , 0.12998939, 0.15157687, 0.16543705, 0.17750354],
        [0.        , 0.25535314, 0.37100754, 0.39090243, 0.40619693],
        [0.        , 0.27149697, 0.28024623, 0.28112326, 0.30420656]]),
 (3, 5))

Observe that there are 3 rows with 5 distances each. We can also look and the neighbors' indexes:

indexes[:3], indexes[:3].shape

This results in:

(array([[    0,  8608, 12831,  8298,  2482],
        [    1,  4966,  5786,  8568,  6759],
        [    2, 13326, 13936,  3618,  9756]]),
 (3, 5))

In the output above, we can see the indexes of each of the 5 neighbors. Now, we can continue to calculate the mean of the 5 distances and plot a graph that counts each row on the X-axis and displays each mean distance on the Y-axis:

dist_means = distances.mean(axis=1)
plt.plot(dist_means)
plt.title('Mean of the 5 neighbors distances for each data point')
plt.xlabel('Count')
plt.ylabel('Mean Distances')

Notice that there is a part of the graph in which the mean distances have uniform values. That Y-axis point in which the means aren't too high or too low is exactly the point we need to identify to cut off the outlier values.

In this case, it is where the mean distance is 3. Let's plot the graph again with a horizontal dotted line to be able to spot it:

dist_means = distances.mean(axis=1)
plt.plot(dist_means)
plt.title('Mean of the 5 neighbors distances for each data point with cut-off line')
plt.xlabel('Count')
plt.ylabel('Mean Distances')
plt.axhline(y = 3, color = 'r', linestyle = '--')

This line marks the mean distance for which above it all values vary. This means that all points with a mean distance above 3 are our outliers. We can find out the indexes of those points using np.where(). This method will output either True or False for each index in regards to the mean above 3 condition:

import numpy as np

# Visually determine cutoff values > 3
outlier_index = np.where(dist_means > 3)
outlier_index

The above code outputs:

(array([  564,  2167,  2415,  2902,  6607,  8047,  8243,  9029, 11892,
        12127, 12226, 12353, 13534, 13795, 14292, 14707]),)

Now we have our outlier point indexes. Let's locate them in the dataframe:

# Filter outlier values
outlier_values = df.iloc[outlier_index]
outlier_values

This results in:

        MedInc 	HouseAge AveRooms 	AveBedrms 	Population 	AveOccup 	Latitude 	Longitude 	MedHouseVal
564 	4.8711 	27.0 	 5.082811 	0.944793 	1499.0 	    1.880803 	37.75 		-122.24 	2.86600
2167 	2.8359 	30.0 	 4.948357 	1.001565 	1660.0 	    2.597809 	36.78 		-119.83 	0.80300
2415 	2.8250 	32.0 	 4.784232 	0.979253 	761.0 	    3.157676 	36.59 		-119.44 	0.67600
2902 	1.1875 	48.0 	 5.492063 	1.460317 	129.0 	    2.047619 	35.38 		-119.02 	0.63800
6607 	3.5164 	47.0 	 5.970639 	1.074266 	1700.0 	    2.936097 	34.18 		-118.14 	2.26500
8047 	2.7260 	29.0 	 3.707547 	1.078616 	2515.0 	    1.977201 	33.84 		-118.17 	2.08700
8243 	2.0769 	17.0 	 3.941667 	1.211111 	1300.0 	    3.611111 	33.78 		-118.18 	1.00000
9029 	6.8300 	28.0 	 6.748744 	1.080402 	487.0 		2.447236 	34.05 		-118.78 	5.00001
11892 	2.6071 	45.0 	 4.225806 	0.903226 	89.0 		2.870968 	33.99 		-117.35 	1.12500
12127 	4.1482 	7.0 	 5.674957 	1.106998 	5595.0 		3.235975 	33.92 		-117.25 	1.24600
12226 	2.8125 	18.0 	 4.962500 	1.112500 	239.0 		2.987500 	33.63 		-116.92 	1.43800
12353 	3.1493 	24.0 	 7.307323 	1.460984 	1721.0 		2.066026 	33.81 		-116.54 	1.99400
13534 	3.7949 	13.0 	 5.832258 	1.072581 	2189.0 		3.530645 	34.17 		-117.33 	1.06300
13795 	1.7567 	8.0 	 4.485173 	1.120264 	3220.0 		2.652389 	34.59 		-117.42 	0.69500
14292 	2.6250 	50.0 	 4.742236 	1.049689 	728.0 		2.260870 	32.74 		-117.13 	2.03200
14707 	3.7167 	17.0 	 5.034130 	1.051195 	549.0 		1.873720 	32.80 		-117.05 	1.80400

Our outlier detection is finished. This is how we spot each data point that deviates from the general data trend. We can see that there are 16 points in our train data that should be further looked at, investigated, maybe treated, or even removed from our data (if they were erroneously input) to improve results. Those points might have resulted from typing errors, mean block values inconsistencies, or even both.

Pros and Cons of KNN

In this section, we'll present some of the pros and cons of using the KNN algorithm.

Pros

It is easy to implement
It is a lazy learning algorithm and therefore doesn't require training on all data points (only using the K-Nearest neighbors to predict). This makes the KNN algorithm much faster than other algorithms that require training with the whole dataset such as Support Vector Machines, linear regression, etc.
Since KNN requires no training before making predictions, new data can be added seamlessly
There are only two parameters required to work with KNN, i.e. the value of K and the distance function

Cons

The KNN algorithm doesn't work well with high dimensional data because with a large number of dimensions, the distance between points gets "weird", and the distance metrics we use don't hold up
Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to find the distance between dimensions with categorical features

Conclusion

KNN is a simple yet powerful algorithm. It can be used for many tasks such as regression, classification, or outlier detection.

KNN has been widely used to find document similarity and pattern recognition. It has also been employed for developing recommender systems and for dimensionality reduction and pre-processing steps for computer vision - particularly face recognition tasks.

In this guide - we've gone through regression, classification and outlier detection using Scikit-Learn's implementation of the K-Nearest Neighbor algorithm.

# python # machine learning # scikit-learn # algorithms

Last Updated: November 16th, 2023

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Cássia SampaioAuthor

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

David Landup

Editor

Dimitrije Stamenic

Editor

Free

Monitor with Ping Bot

# monitoring

# uptime

# observability

Reliable monitoring for your app, databases, infrastructure, and the vendors they rely on. Ping Bot is a powerful uptime and performance monitoring tool that helps notify you and resolve issues before they affect your customers.

Learn more

Free

Vendor Alerts with Ping Bot

# monitoring

# uptime

# observability

Get detailed incident alerts about the status of your favorite vendors. Don't learn about downtime from your customers, be the first to know with Ping Bot.