Calculating Variance and Standard Deviation in Python

Introduction

Two closely related statistical measures will allow us to get an idea of the spread or dispersion of our data. The first measure is the variance, which measures how far from their mean the individual observations in our data are. The second is the standard deviation, which is the square root of the variance and measures the amount of variation or dispersion of a dataset.

In this tutorial, we'll learn how to calculate the variance and the standard deviation in Python. We'll first code a Python function for each measure and later, we'll learn how to use the Python statistics module to accomplish the same task quickly.

With this knowledge, we'll be able to take a first look at our datasets and get a quick idea of the general dispersion of our data.

Calculating the Variance

In statistics, the variance is a measure of how far individual (numeric) values in a dataset are from the mean or average value. The variance is often used to quantify spread or dispersion. Spread is a characteristic of a sample or population that describes how much variability there is in it.

A high variance tells us that the values in our dataset are far from their mean. So, our data will have high levels of variability. On the other hand, a low variance tells us that the values are quite close to the mean. In this case, the data will have low levels of variability.

To calculate the variance in a dataset, we first need to find the difference between each individual value and the mean. The variance is the average of the squares of those differences. We can express the variance with the following math expression:

$$
\sigma^2 = \frac{1}{n}{\sum_{i=0}^{n-1}{(x_i - \mu)^2}}
$$

In this equation, xi stands for individual values or observations in a dataset. μ stands for the mean or average of those values. n is the number of values in the dataset.

The term xi - μ is called the deviation from the mean. So, the variance is the mean of square deviations. That's why we denoted it as σ2.

Say we have a dataset [3, 5, 2, 7, 1, 3]. To find its variance, we need to calculate the mean which is:

$$
(3 + 5 + 2 + 7 + 1 + 3) / 6 = 3.5
$$

Then, we need to calculate the sum of the square deviation from the mean of all the observations. Here's how:

$$
(3 - 3.5)^2 + (5 - 3.5)^2 + (2 - 3.5)^2 + (7 - 3.5)^2 + (1 - 3.5)^2 + (3 - 3.5)^2 = 23.5
$$

To find the variance, we just need to divide this result by the number of observations like this:

$$
23.5 / 6 = 3.916666667
$$

That's all. The variance of our data is 3.916666667. The variance is difficult to understand and interpret, particularly how strange its units are.

For example, if the observations in our dataset are measured in pounds, then the variance will be measured in square pounds. So, we can say that the observations are, on average, 3.916666667 square pounds far from the mean 3.5. Fortunately, the standard deviation comes to fix this problem but that's a topic of a later section.

If we apply the concept of variance to a dataset, then we can distinguish between the sample variance and the population variance. The population variance is the variance that we saw before and we can calculate it using the data from the full population and the expression for σ2.

The sample variance is denoted as S2 and we can calculate it using a sample from a given population and the following expression:

$$
S^2 = \frac{1}{n}{\sum_{i=0}^{n-1}{(x_i - X)^2}}
$$

This expression is quite similar to the expression for calculating σ2 but in this case, xi represents individual observations in the sample and X is the mean of the sample.

S2 is commonly used to estimate the variance of a population (σ2) using a sample of data. However, S2 systematically underestimates the population variance. For that reason, it's referred to as a biased estimator of the population variance.

When we have a large sample, S2 can be an adequate estimator of σ2. For small samples, it tends to be too low. Fortunately, there is another simple statistic that we can use to better estimate σ2. Here's its equation:

$$
S^2_{n-1} = \frac{1}{n-1}{\sum_{i=0}^{n-1}{(x_i - X)^2}}
$$

This looks quite similar to the previous expression. It looks like the squared deviation from the mean but in this case, we divide by n - 1 instead of by n. This is called Bessel's correction. Bessel's correction illustrates that S2n-1 is the best unbiased estimator for the population variance. So, in practice, we'll use this equation to estimate the variance of a population using a sample of data. Note that S2n-1 is also known as the variance with n - 1 degrees of freedom.

Now that we've learned how to calculate the variance using its math expression, it's time to get into action and calculate the variance using Python.

Coding a variance() Function in Python

To calculate the variance, we're going to code a Python function called variance(). This function will take some data and return its variance. Inside variance(), we're going to calculate the mean of the data and the square deviations from the mean. Finally, we're going to calculate the variance by finding the average of the deviations.

Here's a possible implementation for variance():

>>> def variance(data):
...     # Number of observations
...     n = len(data)
...     # Mean of the data
...     mean = sum(data) / n
...     # Square deviations
...     deviations = [(x - mean) ** 2 for x in data]
...     # Variance
...     variance = sum(deviations) / n
...     return variance
...

>>> variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
5.76

We first calculate the number of observations (n) in our data using the built-in function len(). Then, we calculate the mean of the data, dividing the total sum of the observations by the number of observations.

The next step is to calculate the square deviations from the mean. To do that, we use a list comprehension that creates a list of square deviations using the expression (x - mean) ** 2 where x stands for every observation in our data.

Finally, we calculate the variance by summing the deviations and dividing them by the number of observations n.

In this case, variance() will calculate the population variance because we're using n instead of n - 1 to calculate the mean of the deviations. If we're working with a sample and we want to estimate the variance of the population, then we'll need to update the expression variance = sum(deviations) / n to variance = sum(deviations) / (n - 1).

We can refactor our function to make it more concise and efficient. Here's an example:

>>> def variance(data, ddof=0):
...     n = len(data)
...     mean = sum(data) / n
...     return sum((x - mean) ** 2 for x in data) / (n - ddof)
...

>>> variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
5.76

>>> variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5], ddof=1)
6.4

In this case, we remove some intermediate steps and temporary variables like deviations and variance. We also turn the list comprehension into a generator expression, which is much more efficient in terms of memory consumption.

Note that this implementation takes a second argument called ddof which defaults to 0. This argument allows us to set the degrees of freedom that we want to use when calculating the variance. For example, ddof=0 will allow us to calculate the variance of a population. Meanwhile, ddof=1 will allow us to estimate the population variance using a sample of data.

Using Python's pvariance() and variance()

Python includes a standard module called statistics that provides some functions for calculating basic statistics of data. In this case, the statistics.pvariance() and statistics.variance() are the functions that we can use to calculate the variance of a population and of a sample respectively.

Here's how Python's pvariance() works:

>>> import statistics

>>> statistics.pvariance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
5.760000000000001

We just need to import the statistics module and then call pvariance() with our data as an argument. That will return the variance of the population.

On the other hand, we can use Python's variance() to calculate the variance of a sample and use it to estimate the variance of the entire population. That's because variance() uses n - 1 instead of n to calculate the variance. Here's how it works:

>>> import statistics

>>> statistics.variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
6.4

This is the sample variance S2. So, the result of using Python's variance() should be an unbiased estimate of the population variance σ2, provided that the observations are representative of the entire population.

Calculating the Standard Deviation

The standard deviation measures the amount of variation or dispersion of a set of numeric values. Standard deviation is the square root of variance σ2 and is denoted as σ. So, if we want to calculate the standard deviation, then all we just have to do is to take the square root of the variance as follows:

$$
\sigma = \sqrt{\sigma^2}
$$

Again, we need to distinguish between the population standard deviation, which is the square root of the population variance (σ2) and the sample standard deviation, which is the square root of the sample variance (S2). We'll denote the sample standard deviation as S:

$$
S = \sqrt{S^2}
$$

Low values of standard deviation tell us that individual values are closer to the mean. High values, on the other hand, tell us that individual observations are far away from the mean of the data.

Values that are within one standard deviation of the mean can be thought of as fairly typical, whereas values that are three or more standard deviations away from the mean can be considered much more atypical. They're also known as outliers.

Unlike variance, the standard deviation will be expressed in the same units of the original observations. Therefore, the standard deviation is a more meaningful and easier to understand statistic. Retaking our example, if the observations are expressed in pounds, then the standard deviation will be expressed in pounds as well.

If we're trying to estimate the standard deviation of the population using a sample of data, then we'll be better served using n - 1 degrees of freedom. Here's a math expression that we typically use to estimate the population variance:
$$
\sigma_x = \sqrt\frac{\sum_{i=0}^{n-1}{(x_i - \mu_x)^2}}{n-1}
$$
Note that this is the square root of the sample variance with n - 1 degrees of freedom. This is equivalent to say:
$$
S_{n-1} = \sqrt{S^2_{n-1}}
$$
Once we know how to calculate the standard deviation using its math expression, we can take a look at how we can calculate this statistic using Python.

Coding a stdev() Function in Python

To calculate the standard deviation of a dataset, we're going to rely on our variance() function. We're also going to use the sqrt() function from the math module of the Python standard library. Here's a function called stdev() that takes the data from a population and returns its standard deviation:

>>> import math

>>> # We relay on our previous implementation for the variance
>>> def variance(data, ddof=0):
...     n = len(data)
...     mean = sum(data) / n
...     return sum((x - mean) ** 2 for x in data) / (n - ddof)
...

>>> def stdev(data):
...     var = variance(data)
...     std_dev = math.sqrt(var)
...     return std_dev

>>> stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.4

Our stdev() function takes some data and returns the population standard deviation. To do that, we rely on our previous variance() function to calculate the variance and then we use math.sqrt() to take the square root of the variance.

If we want to use stdev() to estimate the population standard deviation using a sample of data, then we just need to calculate the variance with n - 1 degrees of freedom as we saw before. Here's a more generic stdev() that allows us to pass in degrees of freedom as well:

>>> def stdev(data, ddof=0):
...     return math.sqrt(variance(data, ddof))

>>> stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.4

>>> stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5], ddof=1)
2.5298221281347035

With this new implementation, we can use ddof=0 to calculate the standard deviation of a population, or we can use ddof=1 to estimate the standard deviation of a population using a sample of data.

Using Python's pstdev() and stdev()

The Python statistics module also provides functions to calculate the standard deviation. We can find pstdev() and stdev(). The first function takes the data of an entire population and returns its standard deviation. The second function takes data from a sample and returns an estimation of the population standard deviation.

Here's how these functions work:

>>> import statistics

>>> statistics.pstdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.4000000000000004

>>> statistics.stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.5298221281347035

We first need to import the statistics module. Then, we can call statistics.pstdev() with data from a population to get its standard deviation.

If we don't have the data for the entire population, which is a common scenario, then we can use a sample of data and use statistics.stdev() to estimate the population standard deviation.

Conclusion

The variance and the standard deviation are commonly used to measure the variability or dispersion of a dataset. These statistic measures complement the use of the mean, the median, and the mode when we're describing our data.

In this tutorial, we've learned how to calculate the variance and the standard deviation of a dataset using Python. We first learned, step-by-step, how to create our own functions to compute them, and later we learned how to use the Python statistics module as a quick way to approach their calculation.

Author image
Holguín, Cuba Twitter Website
Leodanis is an industrial engineer who loves Python and software development. He is a self-taught Python programmer with 5+ years of experience building desktop applications with PyQt.