Kernel Density Estimation in Python Using Scikit-Learn

Introduction

This article is an introduction to kernel density estimation using Python's machine learning library scikit-learn.

Kernel density estimation (KDE) is a non-parametric method for estimating the probability density function of a given random variable. It is also referred to by its traditional name, the Parzen-Rosenblatt Window method, after its discoverers.

Given a sample of independent, identically distributed (i.i.d) observations \((x_1,x_2,\ldots,x_n)\) of a random variable from an unknown source distribution, the kernel density estimate, is given by:

$$
p(x) = \frac{1}{nh} \Sigma_{j=1}^{n}K(\frac{x-x_j}{h})
$$

where \(K(a)\) is the kernel function and \(h\) is the smoothing parameter, also called the bandwidth. Various kernels are discussed later in this article, but just to understand the math, let's take a look at a simple example.

Example Computation

Suppose we have the sample points [-2,-1,0,1,2], with a linear kernel given by: \(K(a)= 1-\frac{|a|}{h}\) and \(h=10\).

x j = [ 2 1 0 1 2 ] | 0 x j | = [ 2 1 0 1 2 ] | 0 x j h | = [ 0.2 0.1 0 0.1 0.2 ] K ( | 0 x j h | ) = [ 0.8 0.9 1 0.9 0.8 ]

Plug the above in the formula for \(p(x)\):

$$
p(0) = \frac{1}{(5)(10)} ( 0.8+0.9+1+0.9+0.8 ) = 0.088
$$

Kernel Density Estimation Using Python

While there are several ways of computing the kernel density estimate in Python, we'll use the popular machine learning library scikit-learn for this purpose. Import the following libraries in your code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV

Synthetic Data

To demonstrate kernel density estimation, synthetic data is generated from two different types of distributions. One is an asymmetric log-normal distribution and the other one is a Gaussian distribution. The following function returns 2000 data points:

def generate_data(seed=17):
    # Fix the seed to reproduce the results
    rand = np.random.RandomState(seed)
    x = []
    dat = rand.lognormal(0, 0.3, 1000)
    x = np.concatenate((x, dat))
    dat = rand.normal(3, 1, 1000)
    x = np.concatenate((x, dat))
    return x

The code below stores the points in x_train. We can either make a scatter plot of these points along the y-axis or we can generate a histogram of these points.

x_train = generate_data()[:, np.newaxis]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
plt.subplot(121)
plt.scatter(np.arange(len(x_train)), x_train, c='red')
plt.xlabel('Sample no.')
plt.ylabel('Value')
plt.title('Scatter plot')
plt.subplot(122)
plt.hist(x_train, bins=50)
plt.title('Histogram')
fig.subplots_adjust(wspace=.3)
plt.show()

scatter plot and histogram estimation

Using Scikit-Learn's KernelDensity

To find the shape of the estimated density function, we can generate a set of points equidistant from each other and estimate the kernel density at each point. The test points are given by:

x_test = np.linspace(-1, 7, 2000)[:, np.newaxis]

Now we will create a KernelDensity object and use the fit() method to find the score of each sample as shown in the code below. The KernelDensity() method uses two default parameters, i.e. kernel=gaussian and bandwidth=1.

model = KernelDensity()
model.fit(x_train)
log_dens = model.score_samples(x_test)

The shape of the distribution can be viewed by plotting the density score for each point, as given below:

plt.fill(x_test, np.exp(log_dens), c='cyan')
plt.show()

density score

Understanding the Bandwidth Parameter

The previous example is not a very impressive estimate of the density function, attributed mainly to the default parameters. Let's experiment with different values of bandwidth to see how it affects density estimation.

bandwidths = [0.01, 0.05, 0.1, 0.5, 1, 4]
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(10, 7))
plt_ind = np.arange(6) + 231

for b, ind in zip(bandwidths, plt_ind):
    kde_model = KernelDensity(kernel='gaussian', bandwidth=b)
    kde_model.fit(x_train)
    score = kde_model.score_samples(x_test)
    plt.subplot(ind)
    plt.fill(x_test, np.exp(score), c='cyan')
    plt.title("h="+str(b))

fig.subplots_adjust(hspace=0.5, wspace=.3)
plt.show()

different bandwidth estimation

We can clearly see that increasing the bandwidth results in a smoother estimate. Very small bandwidth values result in spiky and jittery curves, while very high values result in a very generalized smooth curve that misses out on important details. It is important to select a balanced value for this parameter.

Tuning the Bandwidth Parameter

The scikit-learn library allows the tuning of the bandwidth parameter via cross-validation and returns the parameter value that maximizes the log-likelihood of data. The function we can use to achieve this is GridSearchCV(), which requires different values of the bandwidth parameter.

bandwidth = np.arange(0.05, 2, .05)
kde = KernelDensity(kernel='gaussian')
grid = GridSearchCV(kde, {'bandwidth': bandwidth})
grid.fit(x_train)

The best model can be retrieved by using the best_estimator_ field of the GridSearchCV object.

Let's look at the optimal kernel density estimate using the Gaussian kernel and print the value of bandwidth as well:

kde = grid.best_estimator_
log_dens = kde.score_samples(x_test)
plt.fill(x_test, np.exp(log_dens), c='green')
plt.title('Optimal estimate with Gaussian kernel')
plt.show()
print("optimal bandwidth: " + "{:.2f}".format(kde.bandwidth))

gaussian best estimator

optimal bandwidth: 0.15

Now, this density estimate seems to model the data very well. The first half of the plot is in agreement with the log-normal distribution and the second half of the plot models the normal distribution quite well.

Different Kernels for Density Estimation

scikit-learn allows kernel density estimation using different kernel functions:

  1. kernel ='cosine':      \(K(a;h) \propto \cos (\frac{\pi a}{2h}) \text { if } |a| < h \)
  2. kernel = 'epanechnikov':   \(K(a;h) \propto 1 - \frac{a^2}{h^2}\)
  3. kernel = 'exponential':     \(K(a;h) \propto \exp (-\frac{|a|}{h})\)
  4. kernel = 'gaussian':     \(K(a;h) \propto \exp(-\frac{a^2}{2h^2})\)
  5. kernel = 'linear':      \(K(a;h) \propto 1 - \frac{|a|}{h} \text { if } |a| < h \)
  6. kernel = 'tophat':      \(K(a;h) \propto 1 \text { if } |a| < h \)

A simple way to understand the way these kernels work is to plot them. This means building a model using a sample of only one value, for example, 0. Next, estimate the density of all points around zero and plot the density along the y-axis. The code below shows the entire process:

kernels = ['cosine', 'epanechnikov', 'exponential', 'gaussian', 'linear', 'tophat']
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(10, 7))
plt_ind = np.arange(6) + 231

for k, ind in zip(kernels, plt_ind):
    kde_model = KernelDensity(kernel=k)
    kde_model.fit([[0]])
    score = kde_model.score_samples(np.arange(-2, 2, 0.1)[:, None])
    plt.subplot(ind)
    plt.fill(np.arange(-2, 2, 0.1)[:, None], np.exp(score), c='blue')
    plt.title(k)

fig.subplots_adjust(hspace=0.5, wspace=.3)
plt.show()

different kernels

Experimenting With Different Kernels

Let's experiment with different kernels and see how they estimate the probability density function for our synthetic data.

We can use GridSearchCV(), as before, to find the optimal bandwidth value. However, for cosine, linear, and tophat kernels GridSearchCV() might give a runtime warning due to some scores resulting in -inf values. One possible way to address this issue is to write a custom scoring function for GridSearchCV().

In the code below, -inf scores for test points are omitted in the my_scores() custom scoring function and a mean value is returned. This is not necessarily the best scheme to handle -inf score values and some other strategy can be adopted, depending upon the data in question.

def my_scores(estimator, X):
    scores = estimator.score_samples(X)
    # Remove -inf
    scores = scores[scores != float('-inf')]
    # Return the mean values
    return np.mean(scores)

kernels = ['cosine', 'epanechnikov', 'exponential', 'gaussian', 'linear', 'tophat']
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(10, 7))
plt_ind = np.arange(6) + 231
h_vals = np.arange(0.05, 1, .1)

for k, ind in zip(kernels, plt_ind):
    grid = GridSearchCV(KernelDensity(kernel=k),
                        {'bandwidth': h_vals},
                        scoring=my_scores)
    grid.fit(x_train)
    kde = grid.best_estimator_
    log_dens = kde.score_samples(x_test)
    plt.subplot(ind)
    plt.fill(x_test, np.exp(log_dens), c='cyan')
    plt.title(k + " h=" + "{:.2f}".format(kde.bandwidth))

fig.subplots_adjust(hspace=.5, wspace=.3)
plt.show()

different kernels custom scoring

The Final Optimized Model

The above example shows how different kernels estimate the density in different ways. One final step is to set up GridSearchCV() so that it not only discovers the optimum bandwidth, but also the optimal kernel for our example data. Here is the final code that also plots the final density estimate and its tuned parameters in the plot title:

grid = GridSearchCV(KernelDensity(),
                    {'bandwidth': h_vals, 'kernel': kernels},
                    scoring=my_scores)
grid.fit(x_train)
best_kde = grid.best_estimator_
log_dens = best_kde.score_samples(x_test)
plt.fill(x_test, np.exp(log_dens), c='green')
plt.title("Best Kernel: " + best_kde.kernel+" h="+"{:.2f}".format(best_kde.bandwidth))
plt.show()

optimized kernel density model

Conclusion

Kernel density estimation using scikit-learn's library sklearn.neighbors has been discussed in this article. The examples are given for univariate data, however it can also be applied to data with multiple dimensions.

While being an intuitive and simple way for density estimation for unknown source distributions, a data scientist should use it with caution as the curse of dimensionality can slow it down considerably.

Author image
Pakistan
I am an educator and I love mathematics and data science!