Introduction
Working with variables in data analysis always drives the question: How are the variables dependent, linked, and varying against each other? Covariance and Correlation measures aid in establishing this.
Covariance brings about the variation across variables. We use covariance to measure how much two variables change with each other. Correlation reveals the relation between the variables. We use correlation to determine how strongly linked two variables are to each other.
In this article, we'll learn how to calculate the covariance and correlation in Python.
Covariance and Correlation - In Simple Terms
Both covariance and correlation are about the relationship between the variables. Covariance defines the directional association between the variables. Covariance values range from -inf to +inf where a positive value denotes that both the variables move in the same direction and a negative value denotes that both the variables move in opposite directions.
Correlation is a standardized statistical measure that expresses the extent to which two variables are linearly related (meaning how much they change together at a constant rate). The strength and directional association of the relationship between two variables are defined by correlation and it ranges from -1 to +1. Similar to covariance, a positive value denotes that both variables move in the same direction whereas a negative value tells us that they move in opposite directions.
Both covariance and correlation are vital tools used in data exploration for feature selection and multivariate analyses. For example, an investor looking to spread the risk of a portfolio might look for stocks with a high covariance, as it suggests that their prices move up at the same time. However, a similar movement is not enough on its own. The investor would then use the correlation metric to determine how strongly linked those stock prices are to each other.
Setup for Python Code - Retrieving Sample Data
With the basics learned from the previous section, let's move ahead to calculate covariance in python. For this example, we will be working on the well-known Iris dataset. We're only working with the setosa
species to be specific, hence this will be just a sample of the dataset about some lovely purple flowers!
Let's have a look at the dataset, on which we will be performing the analysis:
We are about to pick two columns, for our analysis - sepal_length
and sepal_width
.
In a new Python file (you can name it covariance_correlation.py
), let's begin by creating two lists with values for the sepal_length
and sepal_width
properties of the flower:
with open('iris_setosa.csv','r') as f:
g=f.readlines()
# Each line is split based on commas, and the list of floats are formed
sep_length = [float(x.split(',')[0]) for x in g[1:]]
sep_width = [float(x.split(',')[1]) for x in g[1:]]
In data science, it always helps to visualize the data you're working on. Here's a Seaborn regression plot (Scatter Plot + linear regression fit) of these setosa properties on different axes:
Visually the data points seem to be having a high correlation close to the regression line. Let's see if our observations match up to their covariance and correlation values.
Calculating Covariance in Python
The following formula computes the covariance:
In the above formula,
- xi, yi - are individual elements of the x and y series
- x̄, y̅ - are the mathematical means of the x and y series
- N - is the number of elements in the series
The denominator is N
for a whole dataset and N - 1
in the case of a sample. As our dataset is a small sample of the entire Iris dataset, we use N - 1
.
With the math formula mentioned above as our reference, let's create this function in pure Python:
def covariance(x, y):
# Finding the mean of the series x and y
mean_x = sum(x)/float(len(x))
mean_y = sum(y)/float(len(y))
# Subtracting mean from the individual elements
sub_x = [i - mean_x for i in x]
sub_y = [i - mean_y for i in y]
numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
denominator = len(x)-1
cov = numerator/denominator
return cov
with open('iris_setosa.csv', 'r') as f:
...
cov_func = covariance(sep_length, sep_width)
print("Covariance from the custom function:", cov_func)
We first find the mean values of our datasets. We then use a list comprehension to iterate over every element in our two series of data and subtract their values from the mean. A for
loop could have been used as well, if that's your preference.
We then use those intermediate values of the two series' and multiply them with each other in another list comprehension. We sum the result of that list and store it as the numerator
. The denominator
is a lot easier to calculate, be sure to decrease it by 1 when you're finding the covariance for sample data!
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
We then return the value when the numerator
is divided by its denominator
, which results in the covariance.
Running our script would give us this output:
Covariance from the custom function: 0.09921632653061219
The positive value denotes that both the variables move in the same direction.
Calculating Correlation in Python
The most widely used formula to compute correlation coefficient is Pearson's "r":
In the above formula,
- xi, yi - are individual elements of the x and y series
- The numerator corresponds to the covariance
- The denominators correspond to the individual standard deviations of x and y
Seems like we've discussed everything we need to get the correlation in this series of articles!
Let's calculate the correlation now:
def correlation(x, y):
# Finding the mean of the series x and y
mean_x = sum(x)/float(len(x))
mean_y = sum(y)/float(len(y))
# Subtracting mean from the individual elements
sub_x = [i-mean_x for i in x]
sub_y = [i-mean_y for i in y]
# covariance for x and y
numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
# Standard Deviation of x and y
std_deviation_x = sum([sub_x[i]**2.0 for i in range(len(sub_x))])
std_deviation_y = sum([sub_y[i]**2.0 for i in range(len(sub_y))])
# squaring by 0.5 to find the square root
denominator = (std_deviation_x*std_deviation_y)**0.5 # short but equivalent to (std_deviation_x**0.5) * (std_deviation_y**0.5)
cor = numerator/denominator
return cor
with open('iris_setosa.csv', 'r') as f:
...
cor_func = correlation(sep_length, sep_width)
print("Correlation from the custom function:", cor_func)
As this value needs the covariance of the two variables, our function pretty much works out that value once again. Once the covariance is computed, we then calculate the standard deviation for each variable. From there, the correlation is simply dividing the covariance with the multiplication of the squares of the standard deviation.
Running this code we get the following output, confirming that these properties have a positive (sign of the value, either +, -, or none if 0) and strong (the value is close to 1) relationship:
Correlation from the custom function: 0.7425466856651597
Conclusion
In this article, we learned two statistical instruments: covariance and correlation in detail. We've learned what their values mean for our data, how they are represented in Mathematics and how to implement them in Python. Both of these measures can be very helpful in determining relationships between two variables.