Introduction
This article is an introduction to kernel density estimation using Python's machine learning library scikit-learn
.
Kernel density estimation (KDE) is a non-parametric method for estimating the probability density function of a given random variable. It is also referred to by its traditional name, the Parzen-Rosenblatt Window method, after its discoverers.
Given a sample of independent, identically distributed (i.i.d) observations \((x_1,x_2,\ldots,x_n)\) of a random variable from an unknown source distribution, the kernel density estimate, is given by:
$$
p(x) = \frac{1}{nh} \Sigma_{j=1}^{n}K(\frac{x-x_j}{h})
$$
where \(K(a)\) is the kernel function and \(h\) is the smoothing parameter, also called the bandwidth. Various kernels are discussed later in this article, but just to understand the math, let's take a look at a simple example.
Example Computation
Suppose we have the sample points [-2,-1,0,1,2], with a linear kernel given by: \(K(a)= 1-\frac{|a|}{h}\) and \(h=10\).
Plug the above in the formula for \(p(x)\):
$$
p(0) = \frac{1}{(5)(10)} ( 0.8+0.9+1+0.9+0.8 ) = 0.088
$$
Kernel Density Estimation Using Python
While there are several ways of computing the kernel density estimate in Python, we'll use the popular machine learning library scikit-learn
for this purpose. Import the following libraries in your code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
Synthetic Data
To demonstrate kernel density estimation, synthetic data is generated from two different types of distributions. One is an asymmetric log-normal distribution and the other one is a Gaussian distribution. The following function returns 2000 data points:
def generate_data(seed=17):
# Fix the seed to reproduce the results
rand = np.random.RandomState(seed)
x = []
dat = rand.lognormal(0, 0.3, 1000)
x = np.concatenate((x, dat))
dat = rand.normal(3, 1, 1000)
x = np.concatenate((x, dat))
return x
The code below stores the points in x_train
. We can either make a scatter plot of these points along the y-axis or we can generate a histogram of these points.
x_train = generate_data()[:, np.newaxis]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
plt.subplot(121)
plt.scatter(np.arange(len(x_train)), x_train, c='red')
plt.xlabel('Sample no.')
plt.ylabel('Value')
plt.title('Scatter plot')
plt.subplot(122)
plt.hist(x_train, bins=50)
plt.title('Histogram')
fig.subplots_adjust(wspace=.3)
plt.show()
Using Scikit-Learn's KernelDensity
To find the shape of the estimated density function, we can generate a set of points equidistant from each other and estimate the kernel density at each point. The test points are given by:
x_test = np.linspace(-1, 7, 2000)[:, np.newaxis]
Now we will create a KernelDensity
object and use the fit()
method to find the score of each sample as shown in the code below. The KernelDensity()
method uses two default parameters, i.e. kernel=gaussian
and bandwidth=1
.
model = KernelDensity()
model.fit(x_train)
log_dens = model.score_samples(x_test)
The shape of the distribution can be viewed by plotting the density score for each point, as given below:
plt.fill(x_test, np.exp(log_dens), c='cyan')
plt.show()
Understanding the Bandwidth Parameter
The previous example is not a very impressive estimate of the density function, attributed mainly to the default parameters. Let's experiment with different values of bandwidth to see how it affects density estimation.
bandwidths = [0.01, 0.05, 0.1, 0.5, 1, 4]
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(10, 7))
plt_ind = np.arange(6) + 231
for b, ind in zip(bandwidths, plt_ind):
kde_model = KernelDensity(kernel='gaussian', bandwidth=b)
kde_model.fit(x_train)
score = kde_model.score_samples(x_test)
plt.subplot(ind)
plt.fill(x_test, np.exp(score), c='cyan')
plt.title("h="+str(b))
fig.subplots_adjust(hspace=0.5, wspace=.3)
plt.show()
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
We can clearly see that increasing the bandwidth results in a smoother estimate. Very small bandwidth values result in spiky and jittery curves, while very high values result in a very generalized smooth curve that misses out on important details. It is important to select a balanced value for this parameter.
Tuning the Bandwidth Parameter
The scikit-learn
library allows the tuning of the bandwidth
parameter via cross-validation and returns the parameter value that maximizes the log-likelihood of data. The function we can use to achieve this is GridSearchCV()
, which requires different values of the bandwidth
parameter.
bandwidth = np.arange(0.05, 2, .05)
kde = KernelDensity(kernel='gaussian')
grid = GridSearchCV(kde, {'bandwidth': bandwidth})
grid.fit(x_train)
The best model can be retrieved by using the best_estimator_
field of the GridSearchCV
object.
Let's look at the optimal kernel density estimate using the Gaussian kernel and print the value of bandwidth as well:
kde = grid.best_estimator_
log_dens = kde.score_samples(x_test)
plt.fill(x_test, np.exp(log_dens), c='green')
plt.title('Optimal estimate with Gaussian kernel')
plt.show()
print("optimal bandwidth: " + "{:.2f}".format(kde.bandwidth))
optimal bandwidth: 0.15
Now, this density estimate seems to model the data very well. The first half of the plot is in agreement with the log-normal distribution and the second half of the plot models the normal distribution quite well.
Different Kernels for Density Estimation
scikit-learn
allows kernel density estimation using different kernel functions:
kernel ='cosine'
: \(K(a;h) \propto \cos (\frac{\pi a}{2h}) \text { if } |a| < h \)kernel = 'epanechnikov'
: \(K(a;h) \propto 1 - \frac{a^2}{h^2}\)kernel = 'exponential'
: \(K(a;h) \propto \exp (-\frac{|a|}{h})\)kernel = 'gaussian'
: \(K(a;h) \propto \exp(-\frac{a^2}{2h^2})\)kernel = 'linear'
: \(K(a;h) \propto 1 - \frac{|a|}{h} \text { if } |a| < h \)kernel = 'tophat'
: \(K(a;h) \propto 1 \text { if } |a| < h \)
A simple way to understand the way these kernels work is to plot them. This means building a model using a sample of only one value, for example, 0. Next, estimate the density of all points around zero and plot the density along the y-axis. The code below shows the entire process:
kernels = ['cosine', 'epanechnikov', 'exponential', 'gaussian', 'linear', 'tophat']
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(10, 7))
plt_ind = np.arange(6) + 231
for k, ind in zip(kernels, plt_ind):
kde_model = KernelDensity(kernel=k)
kde_model.fit([[0]])
score = kde_model.score_samples(np.arange(-2, 2, 0.1)[:, None])
plt.subplot(ind)
plt.fill(np.arange(-2, 2, 0.1)[:, None], np.exp(score), c='blue')
plt.title(k)
fig.subplots_adjust(hspace=0.5, wspace=.3)
plt.show()
Experimenting With Different Kernels
Let's experiment with different kernels and see how they estimate the probability density function for our synthetic data.
We can use GridSearchCV()
, as before, to find the optimal bandwidth
value. However, for cosine
, linear
, and tophat
kernels GridSearchCV()
might give a runtime warning due to some scores resulting in -inf
values. One possible way to address this issue is to write a custom scoring function for GridSearchCV()
.
In the code below, -inf
scores for test points are omitted in the my_scores()
custom scoring function and a mean value is returned. This is not necessarily the best scheme to handle -inf
score values and some other strategy can be adopted, depending upon the data in question.
def my_scores(estimator, X):
scores = estimator.score_samples(X)
# Remove -inf
scores = scores[scores != float('-inf')]
# Return the mean values
return np.mean(scores)
kernels = ['cosine', 'epanechnikov', 'exponential', 'gaussian', 'linear', 'tophat']
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(10, 7))
plt_ind = np.arange(6) + 231
h_vals = np.arange(0.05, 1, .1)
for k, ind in zip(kernels, plt_ind):
grid = GridSearchCV(KernelDensity(kernel=k),
{'bandwidth': h_vals},
scoring=my_scores)
grid.fit(x_train)
kde = grid.best_estimator_
log_dens = kde.score_samples(x_test)
plt.subplot(ind)
plt.fill(x_test, np.exp(log_dens), c='cyan')
plt.title(k + " h=" + "{:.2f}".format(kde.bandwidth))
fig.subplots_adjust(hspace=.5, wspace=.3)
plt.show()
The Final Optimized Model
The above example shows how different kernels estimate the density in different ways. One final step is to set up GridSearchCV()
so that it not only discovers the optimum bandwidth, but also the optimal kernel for our example data. Here is the final code that also plots the final density estimate and its tuned parameters in the plot title:
grid = GridSearchCV(KernelDensity(),
{'bandwidth': h_vals, 'kernel': kernels},
scoring=my_scores)
grid.fit(x_train)
best_kde = grid.best_estimator_
log_dens = best_kde.score_samples(x_test)
plt.fill(x_test, np.exp(log_dens), c='green')
plt.title("Best Kernel: " + best_kde.kernel+" h="+"{:.2f}".format(best_kde.bandwidth))
plt.show()
Conclusion
Kernel density estimation using scikit-learn
's library sklearn.neighbors
has been discussed in this article. The examples are given for univariate data, however it can also be applied to data with multiple dimensions.
While being an intuitive and simple way for density estimation for unknown source distributions, a data scientist should use it with caution as the curse of dimensionality can slow it down considerably.