## Calculating Variance and Standard Deviation in Python

### Introduction

Two closely related statistical measures will allow us to get an idea of the spread or dispersion of our data. The first measure is the *variance*, which measures how far from their mean the individual observations in our data are. The second is the *standard deviation*, which is the square root of the variance and measures the amount of variation or dispersion of a dataset.

In this tutorial, we'll learn how to calculate the variance and the standard deviation in Python. We'll first code a Python function for each measure and later, we'll learn how to use the Python `statistics`

module to accomplish the same task quickly.

With this knowledge, we'll be able to take a first look at our datasets and get a quick idea of the general dispersion of our data.

### Calculating the Variance

In statistics, the **variance** is a measure of how far individual (numeric) values in a dataset are from the mean or average value. The variance is often used to quantify spread or dispersion. Spread is a characteristic of a sample or population that describes how much variability there is in it.

A high variance tells us that the values in our dataset are far from their mean. So, our data will have high levels of variability. On the other hand, a low variance tells us that the values are quite close to the mean. In this case, the data will have low levels of variability.

To calculate the variance in a dataset, we first need to find the difference between each individual value and the mean. The variance is the average of the squares of those differences. We can express the variance with the following math expression:

$$

\sigma^2 = \frac{1}{n}{\sum_{i=0}^{n-1}{(x_i - \mu)^2}}

$$

In this equation, **x _{i}** stands for individual values or observations in a dataset.

**μ**stands for the mean or average of those values.

**n**is the number of values in the dataset.

The term **x _{i} - μ** is called the

**deviation from the mean**. So, the variance is the mean of square deviations. That's why we denoted it as

**σ**.

^{2}Say we have a dataset [3, 5, 2, 7, 1, 3]. To find its variance, we need to calculate the mean which is:

$$

(3 + 5 + 2 + 7 + 1 + 3) / 6 = 3.5

$$

Then, we need to calculate the sum of the square deviation from the mean of all the observations. Here's how:

$$

(3 - 3.5)^2 + (5 - 3.5)^2 + (2 - 3.5)^2 + (7 - 3.5)^2 + (1 - 3.5)^2 + (3 - 3.5)^2 = 23.5

$$

To find the variance, we just need to divide this result by the number of observations like this:

$$

23.5 / 6 = 3.916666667

$$

That's all. The variance of our data is *3.916666667*. The variance is difficult to understand and interpret, particularly how strange its units are.

For example, if the observations in our dataset are measured in pounds, then the variance will be measured in square pounds. So, we can say that the observations are, on average, *3.916666667* square pounds far from the mean 3.5. Fortunately, the standard deviation comes to fix this problem but that's a topic of a later section.

If we apply the concept of variance to a dataset, then we can distinguish between the **sample variance** and the **population variance**. The population variance is the variance that we saw before and we can calculate it using the data from the full population and the expression for **σ ^{2}**.

The sample variance is denoted as **S ^{2}** and we can calculate it using a sample from a given population and the following expression:

$$

S^2 = \frac{1}{n}{\sum_{i=0}^{n-1}{(x_i - X)^2}}

$$

This expression is quite similar to the expression for calculating **σ ^{2}** but in this case,

**x**represents individual observations in the sample and

_{i}**X**is the mean of the sample.

**S ^{2}** is commonly used to estimate the variance of a population (

**σ**) using a sample of data. However,

^{2}**S**systematically underestimates the population variance. For that reason, it's referred to as a

^{2}**biased estimator**of the population variance.

When we have a large sample, **S ^{2}** can be an adequate estimator of

**σ**. For small samples, it tends to be too low. Fortunately, there is another simple statistic that we can use to better estimate

^{2}**σ**. Here's its equation:

^{2}$$

S^2_{n-1} = \frac{1}{n-1}{\sum_{i=0}^{n-1}{(x_i - X)^2}}

$$

This looks quite similar to the previous expression. It looks like the squared deviation from the mean but in this case, we divide by **n - 1** instead of by **n**. This is called Bessel's correction. Bessel's correction illustrates that **S ^{2}_{n-1}** is the best unbiased estimator for the population variance. So, in practice, we'll use this equation to estimate the variance of a population using a sample of data. Note that

**S**is also known as the variance with

^{2}_{n-1}**n - 1**degrees of freedom.

Now that we've learned how to calculate the variance using its math expression, it's time to get into action and calculate the variance using Python.

#### Coding a variance() Function in Python

To calculate the variance, we're going to code a Python function called `variance()`

. This function will take some data and return its variance. Inside `variance()`

, we're going to calculate the mean of the data and the square deviations from the mean. Finally, we're going to calculate the variance by finding the average of the deviations.

Here's a possible implementation for `variance()`

:

```
>>> def variance(data):
... # Number of observations
... n = len(data)
... # Mean of the data
... mean = sum(data) / n
... # Square deviations
... deviations = [(x - mean) ** 2 for x in data]
... # Variance
... variance = sum(deviations) / n
... return variance
...
>>> variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
5.76
```

We first calculate the number of observations (`n`

) in our data using the built-in function `len()`

. Then, we calculate the mean of the data, dividing the total sum of the observations by the number of observations.

The next step is to calculate the square deviations from the mean. To do that, we use a `list`

comprehension that creates a `list`

of square deviations using the expression `(x - mean) ** 2`

where `x`

stands for every observation in our data.

Finally, we calculate the variance by summing the deviations and dividing them by the number of observations `n`

.

In this case, `variance()`

will calculate the population variance because we're using **n** instead of **n - 1** to calculate the mean of the deviations. If we're working with a sample and we want to estimate the variance of the population, then we'll need to update the expression `variance = sum(deviations) / n`

to `variance = sum(deviations) / (n - 1)`

.

We can refactor our function to make it more concise and efficient. Here's an example:

```
>>> def variance(data, ddof=0):
... n = len(data)
... mean = sum(data) / n
... return sum((x - mean) ** 2 for x in data) / (n - ddof)
...
>>> variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
5.76
>>> variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5], ddof=1)
6.4
```

In this case, we remove some intermediate steps and temporary variables like `deviations`

and `variance`

. We also turn the `list`

comprehension into a generator expression, which is much more efficient in terms of memory consumption.

Note that this implementation takes a second argument called `ddof`

which defaults to `0`

. This argument allows us to set the degrees of freedom that we want to use when calculating the variance. For example, `ddof=0`

will allow us to calculate the variance of a population. Meanwhile, `ddof=1`

will allow us to estimate the population variance using a sample of data.

#### Using Python's pvariance() and variance()

Python includes a standard module called `statistics`

that provides some functions for calculating basic statistics of data. In this case, the `statistics.pvariance()`

and `statistics.variance()`

are the functions that we can use to calculate the variance of a population and of a sample respectively.

Here's how Python's `pvariance()`

works:

```
>>> import statistics
>>> statistics.pvariance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
5.760000000000001
```

We just need to import the `statistics`

module and then call `pvariance()`

with our data as an argument. That will return the variance of the population.

On the other hand, we can use Python's `variance()`

to calculate the variance of a sample and use it to estimate the variance of the entire population. That's because `variance()`

uses **n - 1** instead of **n** to calculate the variance. Here's how it works:

```
>>> import statistics
>>> statistics.variance([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
6.4
```

This is the sample variance **S ^{2}**. So, the result of using Python's

`variance()`

should be an unbiased estimate of the population variance **σ**, provided that the observations are representative of the entire population.

^{2}### Calculating the Standard Deviation

The **standard deviation** measures the amount of variation or dispersion of a set of numeric values. Standard deviation is the square root of variance **σ ^{2}** and is denoted as

**σ**. So, if we want to calculate the standard deviation, then all we just have to do is to take the square root of the variance as follows:

$$

\sigma = \sqrt{\sigma^2}

$$

Again, we need to distinguish between the population standard deviation, which is the square root of the population variance (**σ ^{2}**) and the sample standard deviation, which is the square root of the sample variance (

**S**). We'll denote the sample standard deviation as

^{2}**S**:

$$

S = \sqrt{S^2}

$$

Low values of standard deviation tell us that individual values are closer to the mean. High values, on the other hand, tell us that individual observations are far away from the mean of the data.

Values that are within one standard deviation of the mean can be thought of as fairly typical, whereas values that are three or more standard deviations away from the mean can be considered much more atypical. They're also known as **outliers**.

Unlike variance, the standard deviation will be expressed in the same units of the original observations. Therefore, the standard deviation is a more meaningful and easier to understand statistic. Retaking our example, if the observations are expressed in pounds, then the standard deviation will be expressed in pounds as well.

If we're trying to estimate the standard deviation of the population using a sample of data, then we'll be better served using **n - 1** degrees of freedom. Here's a math expression that we typically use to estimate the population variance:

$$

\sigma_x = \sqrt\frac{\sum_{i=0}^{n-1}{(x_i - \mu_x)^2}}{n-1}

$$

Note that this is the square root of the sample variance with **n - 1** degrees of freedom. This is equivalent to say:

$$

S_{n-1} = \sqrt{S^2_{n-1}}

$$

Once we know how to calculate the standard deviation using its math expression, we can take a look at how we can calculate this statistic using Python.

#### Coding a stdev() Function in Python

To calculate the standard deviation of a dataset, we're going to rely on our `variance()`

function. We're also going to use the `sqrt()`

function from the `math`

module of the Python standard library. Here's a function called `stdev()`

that takes the data from a population and returns its standard deviation:

```
>>> import math
>>> # We relay on our previous implementation for the variance
>>> def variance(data, ddof=0):
... n = len(data)
... mean = sum(data) / n
... return sum((x - mean) ** 2 for x in data) / (n - ddof)
...
>>> def stdev(data):
... var = variance(data)
... std_dev = math.sqrt(var)
... return std_dev
>>> stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.4
```

Our `stdev()`

function takes some `data`

and returns the population standard deviation. To do that, we rely on our previous `variance()`

function to calculate the variance and then we use `math.sqrt()`

to take the square root of the variance.

If we want to use `stdev()`

to estimate the population standard deviation using a sample of data, then we just need to calculate the variance with **n - 1** degrees of freedom as we saw before. Here's a more generic `stdev()`

that allows us to pass in degrees of freedom as well:

```
>>> def stdev(data, ddof=0):
... return math.sqrt(variance(data, ddof))
>>> stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.4
>>> stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5], ddof=1)
2.5298221281347035
```

With this new implementation, we can use `ddof=0`

to calculate the standard deviation of a population, or we can use `ddof=1`

to estimate the standard deviation of a population using a sample of data.

#### Using Python's pstdev() and stdev()

The Python `statistics`

module also provides functions to calculate the standard deviation. We can find `pstdev()`

and `stdev()`

. The first function takes the data of an entire population and returns its standard deviation. The second function takes data from a sample and returns an estimation of the population standard deviation.

Here's how these functions work:

```
>>> import statistics
>>> statistics.pstdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.4000000000000004
>>> statistics.stdev([4, 8, 6, 5, 3, 2, 8, 9, 2, 5])
2.5298221281347035
```

We first need to import the `statistics`

module. Then, we can call `statistics.pstdev()`

with data from a population to get its standard deviation.

If we don't have the data for the entire population, which is a common scenario, then we can use a sample of data and use `statistics.stdev()`

to estimate the population standard deviation.

### Conclusion

The **variance** and the **standard deviation** are commonly used to measure the variability or dispersion of a dataset. These statistic measures complement the use of the mean, the median, and the mode when we're describing our data.

In this tutorial, we've learned how to calculate the variance and the standard deviation of a dataset using Python. We first learned, step-by-step, how to create our own functions to compute them, and later we learned how to use the Python ** statistics module** as a quick way to approach their calculation.