Calculate Mean Across Multiple DataFrames in Pandas

Introduction

The Pandas library offers a plethora of functions that make data manipulation and analysis super simple (or at least simpler). One such function is the mean() function, which allows you to calculate the average of values in a DataFrame. But what if you're working with multiple DataFrames? In this Byte, we'll explore how to calculate the mean across multiple DataFrames.

Why Calculate Mean Across Multiple DataFrames?

There are numerous scenarios where you might have multiple DataFrames and need to calculate the mean across all of them. For example, you might have data spread across multiple DataFrames due to the size of the data, different data sources, or maybe the data is simply segmented for easier manipulation or storage in files. In these cases, calculating the mean across all these DataFrames can provide a holistic view of the data and can be useful for certain statistical analyses.

Calculating Mean in a Single DataFrame

Before we get into calculating mean across multiple DataFrames, let's first understand how to calculate mean in a single DataFrame. Here's how we'd do it:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
   'A': [1, 2, 3, 4, 5],
   'B': [2, 3, 4, 5, 6],
   'C': [3, 4, 5, 6, 7]
})

# Calculate mean
mean = df.mean()

print(mean)

When you run this code, you'll get the following output:

A    3.0
B    4.0
C    5.0
dtype: float64

In this simple example, the mean() function calculates the mean of each column in the DataFrame.

Extending to Multiple DataFrames

Now that we know how to calculate the mean in a single DataFrame, let's extend this to multiple DataFrames. To do this, it'd be easiest if we concatenated the DataFrames and then calculate the mean. This can be done using the concat() method.

# Create two more DataFrames
df1 = pd.DataFrame({
   'A': [6, 7, 8, 9, 10],
   'B': [7, 8, 9, 10, 11],
   'C': [8, 9, 10, 11, 12]
})

df2 = pd.DataFrame({
   'A': [11, 12, 13, 14, 15],
   'B': [12, 13, 14, 15, 16],
   'C': [13, 14, 15, 16, 17]
})

# Concatenate DataFrames
df_concat = pd.concat([df, df1, df2])

# Calculate mean
mean_concat = df_concat.mean()

print(mean_concat)

The output will be:

A     8.0
B     9.0
C    10.0
dtype: float64
Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

First we concatenate the three DataFrames using pd.concat(). We then calculate the mean of the new concatenated DataFrame using the mean() function.

Note: The pd.concat() function concatenates along the vertical axis by default. If your DataFrames have the same columns, this is typically what you want.

However, if your DataFrames have different columns, you might want to concatenate along the horizontal axis. You can do this by setting the axis parameter to 1: pd.concat([df1, df2], axis=1). This would be useful if they have different columns and you just want them in a common DataFrame to run analysis on, like with the mean() method.

Use Cases

Calculating the mean across multiple DataFrames in Pandas can help in a variety of scenarios. Let's see a few possible use-cases.

One of the most common scenarios is when you're dealing with a large dataset that's been split into multiple DataFrames for easier handling. In such cases, calculating the mean across these DataFrames can give you a more holistic understanding of your data.

Consider the case of a data analyst working with sales data from a multinational company. The data is split by region, each represented by a separate DataFrame. To get a global perspective on average sales, the analyst would need to calculate the mean across all these DataFrames.

import pandas as pd

# Assume we have three DataFrames for sales data in three different regions
df1 = pd.DataFrame({'sales': [100, 200, 300]})
df2 = pd.DataFrame({'sales': [400, 500, 600]})
df3 = pd.DataFrame({'sales': [700, 800, 900]})

# Calculate the mean across all DataFrames
mean_sales = pd.concat([df1, df2, df3]).mean()
print(mean_sales)

Output:

sales    500.0
dtype: float64

Another use-case could be time-series analysis, where you might have data split across multiple DataFrames, each representing a different time period. Calculating the mean across these DataFrames can provide better insights into trends and patterns over time.

Conclusion

In this Byte, we calculated the mean across multiple DataFrames in Pandas. We started by understanding the calculation of mean in a single DataFrame, then extended this concept to multiple DataFrames. We also pointed out some use-cases where this technique would be particularly useful, like when dealing with split datasets or conducting time-series analysis.

Last Updated: September 21st, 2023
Was this helpful?

Ā© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms