Efficient Data Manipulation with Apply() Function in Pandas

Introduction

The apply() function is a powerful tool in Python for data analysis and manipulation. It is a valuable instrument for any analyst's toolkit, as it can be seamlessly integrated with other Pandas functions and custom functions to perform complex data transformations.

In this article, we will learn how to leverage the apply() function in Pandas for efficient and flexible data manipulation. We will examine code examples and understand the advantages and disadvantages of this function in various scenarios.

Overview of the Apply() Function

The apply() function allows you to implement a range of transformations on your data. You can define your own function for specific tasks, such as string manipulations, complex operations, or custom calculations. Once defined, you can apply() these to your DataFrame without the need to code them again each time you need to use them. Similarly, you can define a lambda function, which is an anonymous and quick operation created on the fly. You can pass it to apply() to instantly use it on your data. The apply() function can also be used with built-in functions that come pre-packaged with Python. Pandas Series and DataFrames are both compatible with the use of apply().

Using Apply() on a Series

Let's define a series of average monthly temperatures (in Celsius) for a city in a year by importing the Pandas package and using the Series() class:

import pandas as pd

city_temps = pd.Series([1, 4, 8, 9, 14, 25, 31, 35, 32, 25, 11, 2])

We can create a custom function to convert these temperatures to the Fahrenheit scale:

def celsius_to_fahrenheit(celsius):
    fahrenheit = (celsius * 9/5) + 32
    return fahrenheit

The function above takes a numeric temperature value in Celsius and converts it to Fahrenheit. Now, we can transform each element of our Series:

temp_fahrenheit = city_temps.apply(celsius_to_fahrenheit)
print(temp_fahrenheit)

The output is a Series containing the transformed values of the city temperatures:

0     33.8
1     39.2
2     46.4
3     48.2
4     57.2
5     77.0
6     87.8
7     95.0
8     89.6
9     77.0
10    51.8
11    35.6
dtype: float64

Applying apply() to a DataFrame

Using apply() provides flexibility for adding or manipulating columns in Pandas DataFrames. Consider a DataFrame with mean monthly temperatures during a year for two cities:

data = {'City 1': [1, 4, 8, 9, 14, 25, 31, 35, 32, 25, 11, 2],
        'City 2': [19, 23, 30, 31, 35, 40, 45, 39, 30, 25, 15, 10]
       }
# Create the DataFrame
df = pd.DataFrame(data)
print(df)

The DataFrame appears as follows:

    City 1  City 2
0        1      19
1        4      23
2        8      30
3        9      31
4       14      35
5       25      40
6       31      45
7       35      39
8       32      30
9       25      25
10      11      15
11       2      10

We can apply() our function on the df DataFrame and create a new column City 1 Fahrenheit:

df['City 1 Fahrenheit'] = df['City 1'].apply(celsius_to_fahrenheit)

# Print the DataFrame
print(df)

The transformation was successful:

    City 1  City 2  City 1 Fahrenheit
0        1      19               33.8
1        4      23               39.2
2        8      30               46.4
3        9      31               48.2
4       14      35               57.2
5       25      40               77.0
6       31      45               87.8
7       35      39               95.0
8       32      30               89.6
9       25      25               77.0
10      11      15               51.8
11       2      10               35.6

An important note here is the use of the axis parameter within the apply() function, which indicates whether we want to perform the operation on the rows or the columns of the DataFrame. Specifying axis=0 applies the operation to each column, while axis=1 applies it to each row. In our previous example, the default axis=0 was used since we didn't explicitly pass a value for axis. This applied our function to each column and returned a value for every row.

Let's explicitly pass axis=1 to apply() our custom function to all columns in the DataFrame. We will create two new columns by applying the transformation to each row.

data = {'City 1': [1, 4, 8, 9, 14, 25, 31, 35, 32, 25, 11, 2],
        'City 2': [19, 23, 30, 31, 35, 40, 45, 39, 30, 25, 15, 10]
       }

df = pd.DataFrame(data)

df[['City 1 Fahrenheit', 'City 2 Fahrenheit']] = df.apply(
    lambda row: pd.Series(
        [celsius_to_fahrenheit(row['City 1']),
         celsius_to_fahrenheit(row['City 2'])]),
         axis=1
)

print(df)

We used a lambda expression here to take each row as an input using axis=1 and returned a Series object with the Fahrenheit values. This resulting series was assigned to the 'City 1 Fahrenheit' and 'City 2 Fahrenheit' columns of the DataFrame.

Determine the Apply() Function's Return Type

You may have already noticed that apply() always returned a Series in our previous examples. Generally, the result of the applied function is a Series object that maintains the indexing structure of the original data. However, we can utilize the result_type parameter and set its value to "expand", which instructs it to return a DataFrame. This is particularly helpful when we need to reshape the data structure or apply functions that generate multiple values.

Let's modify our previous example to see how we can return a DataFrame using the result_type parameter.

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

data = {'City 1': [1, 4, 8, 9, 14, 25, 31, 35, 32, 25, 11, 2],
        'City 2': [19, 23, 30, 31, 35, 40, 45, 39, 30, 25, 15, 10]
       }

df = pd.DataFrame(data)

df[['City 1 (Fahrenheit)', 'City 2 (Fahrenheit)']] = df.apply(lambda row: celsius_to_fahrenheit(row), axis=1, result_type='expand')

print(df)

The output is the same as we obtained earlier:

    City 1  City 2  City 1 (Fahrenheit)  City 2 (Fahrenheit)
0        1      19                 33.8                 66.2
1        4      23                 39.2                 73.4
2        8      30                 46.4                 86.0
3        9      31                 48.2                 87.8
4       14      35                 57.2                 95.0
5       25      40                 77.0                104.0
6       31      45                 87.8                113.0
7       35      39                 95.0                102.2
8       32      30                 89.6                 86.0
9       25      25                 77.0                 77.0
10      11      15                 51.8                 59.0
11       2      10                 35.6                 50.0

It is clear that result_type enables us to greatly streamline the transformation of new columns, as we no longer need to explicitly manipulate the output to convert it to a DataFrame structure.

Performance Considerations

Although the apply() function appears to be a powerful tool for simplifying data transformations, it is better suited for specific types of tasks. In addition to custom data transformations or row and column-wise operations, as mentioned earlier, it can also be employed to combine columns for data cleaning or feature engineering. Furthermore, it is useful for handling missing values, as we can define and apply custom imputation methods.

It is worth considering alternative approaches to apply() if you need optimal performance for large datasets or tasks that are computationally intensive. You should use built-in vectorized operations wherever possible. For instance, you could use the sum() method instead of apply(sum). You can also leverage built-in string operations like str.replace() and conditional operations like where() for efficient aggregations and group-wise operations. Additionally, consider Python packages such as Swifter or Dask, which use parallel processing to efficiently work with larger datasets. Depending on your dataset size and available computational resources, these alternatives can significantly enhance the performance of data manipulation tasks compared to using apply() alone.

Final Thoughts

In summary, the apply() function serves as a valuable resource for data manipulation, especially in routine tasks that require repetitive code. This function allows for seamless integration of custom or built-in functions with Pandas Series and DataFrames. Identifying opportunities to utilize the apply() function can significantly improve your productivity in daily tasks.

Last Updated: July 5th, 2023
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms