The Power of Indexing: Boosting Data Wrangling Efficiency with Pandas

Introduction

Pandas is the most widely used Python library for data manipulation, and it allows us to access and manipulate data efficiently.

By understanding and utilizing indexing techniques effectively in Pandas, we can significantly improve the speed and efficiency of our data-wrangling tasks.

In this article, we'll explore various indexing techniques in Pandas, and we'll see how to leverage them for faster data wrangling.

Introducing Indexing in Pandas

The Pandas library provides two primary objects: Series and DataFrames.

A Pandas Series is a one-dimensional labeled array, capable of holding any kind of data type.

A Pandas DataFrame is a table, similar to a spreadsheet, capable of storing any kind of data and is built with rows and columns.

To be more precise, a Pandas DataFrame can also be seen as an ordered collection of Pandas Series.

So, both Series and DataFrames have an index, which provides a way to uniquely identify and access every single element.

In this article, we'll demonstrate some indexing techniques in Pandas to enhance your daily data manipulation tasks.

Coding Indexing Techniques in Pandas

Now, let's explore some indexing techniques using actual Python code.

Integer-Based Indexing

We'll begin with the integer-based method that enables us to select rows and columns in a data frame.

But first, let's understand how we can create a data frame in Pandas:

import pandas as pd

# Create tabular data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [6, 7, 8, 9, 10],
    'C': [11, 12, 13, 14, 15]
}

# Create data frame
df = pd.DataFrame(data)

# Show data frame
print(df)

This will produce:

   A   B   C
0  1   6  11
1  2   7  12
2  3   8  13
3  4   9  14
4  5  10  15

As we can see, the data for a Pandas data frame are created in the same manner we create a dictionary in Python. In fact, the names of the columns are the keys and the numbers in the lists are the values. Column names and values are separated by a colon, exactly like keys and values in dictionaries. Lastly, they're housed within curly brackets.

The integer-based methodology uses the method iloc[] for indexing a data frame. For example, if we want to index two rows, we can type the following:

# Select rows from index 1 to 3 using .iloc[]
sliced_rows = df.iloc[1:3]

# Print sliced rows
print(sliced_rows)

And we get:

    A  B   C
1   2  7  12
2   3  8  13

Note: Remember that in Python we start counting from 0, iloc[1:3] selects the second and the third row.

Now, iloc[] can also select columns like so:

# Select columns 0 to 2 using .iloc[]
sliced_cols = df.iloc[:, 0:2]

# Print sliced columns
print(sliced_cols)

And we get:

   A   B
0  1   6
1  2   7
2  3   8
3  4   9
4  5  10

So, in this case, the colon inside the square brackets means that we want to take all the values in the rows. Then, after the comma, we specify which columns we want to get (remembering that we start counting from 0).

Another way to slice indexes with integers is by using the loc[] method. For example, like so:

# Select rows with index labels using .loc[]
sliced_rows = df.loc[1:3]

# Show sliced rows
print(sliced_rows)

And we get:

   A  B   C
1  2  7  12
2  3  8  13
3  4  9  14

Note: Taking a deep look at both loc[] and iloc[] methods, we can see that in .loc[], the start and end labels are both inclusive, while iloc[] includes the start index and excludes the end index.

Also, we want to add that the loc[] method gives us the possibility to slice a Pandas DataFrame with renamed indexes. Let's see what we mean with an example:

import pandas as pd

# Create tabular data
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [6, 7, 8, 9, 10],
    'C': [11, 12, 13, 14, 15]
}

# Rename indexes
df = pd.DataFrame(data, index=['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5'])

# Select rows with index labels from 'Row_2' to 'Row_4'
sliced_rows = df.loc['Row_2':'Row_4']

# Print sliced rows
print(sliced_rows)

And we get:

       A  B   C
Row_2  2  7  12
Row_3  3  8  13
Row_4  4  9  14

So, as we can see, now the indexes are no longer integers: they are strings and the loc[] method can be used to slice the data frame as we did.

Boolean Indexing

Boolean indexing involves selecting rows or columns based on a condition expressed as a boolean. The data frame (or the series) will be filtered to include only the rows or columns that satisfy the given condition.

For example, suppose we have a data frame with all numeric values. We want to filter the data frame by indexing a column so that it shows us only the values greater than two. We can do it like so:

import pandas as pd

data = {
    'A': [1, 2, 3, 4, 5],
    'B': [6, 7, 8, 9, 10],
    'C': [11, 12, 13, 14, 15]
}

# Create data frame
df = pd.DataFrame(data)

# Select rows where column 'A' is greater than 2
condition = df['A'] > 2

# Create filtered data frame
filtered_rows = df[condition]

# Print new data frame
print(filtered_rows)

And we get:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

   A   B   C
2  3   8  13
3  4   9  14
4  5  10  15

So, with condition = df['A'] > 2, we've created a Pandas series that gets the values greater than two in column A. Then, with filtered_rows = df[condition], we've created the filtered dataframe that shows only the rows that match the condition we imposed on column A.

Of course, we can index a dataframe so that it matches different conditions, even for different columns. For example, say we want to add a condition on column A and on column B. We can do it like so:

# Imposing the condition on two columns
condition = (df['A'] > 2) & (df['B'] < 10)

# Create filtered dataframe
filtered_rows = df[condition]

# Print filtered dataframe
print(filtered_rows)

And we get:

   A  B   C
2  3  8  13
3  4  9  14

So, to add multiple conditions, we use the operator &.

Also, we can even slice an entire data frame. For example, say that we just want to see the columns that have values greater than eight. We can do it like so:

# Imposing the condition on the entire data frame
condition = (df > 8).all()

# Index for the imposed condition
filtered_cols = df.loc[:, condition]

# Print result
print(filtered_cols)

And we get:

    C
0  11
1  12
2  13
3  14
4  15

And so, only column C matches the imposed condition.
So, with the method all(), we're imposing a condition on the entire data frame.

Setting New Indexes and Resetting to Old Ones

There are situations in which we may take a column of a Pandas data frame and use it as an index for the entire data frame. For example, in cases where this kind of manipulation may result in faster slicing of the indexes.

For example, consider we have a data frame that stores data related to countries, cities, and their respective populations. We may want to set the city column as the index of the data frame. We can do it like so:

import pandas as pd

data = {
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Country': ['USA', 'USA', 'USA', 'USA'],
    'Population': [8623000, 4000000, 2716000, 2302000]
}

df = pd.DataFrame(data)

# Set 'City' as the new index
df.set_index(['City'], inplace=True)

# Print indexed data frame
print(df)

And we have:

            Country  Population
City                           
New York        USA     8623000
Los Angeles     USA     4000000
Chicago         USA     2716000
Houston         USA     2302000

Note that we used a similar method before, specifically at the end of the paragraph "Integer-Based Indexing". That method was used to rename the indexes: we had numbers in the beginning and we renamed them as strings.

In this last case, a column has become the index of the data frame. This means that we can filter it using loc[] as we did before:

# Slice for indexed column
sliced_rows = df.loc['New York':'Chicago']

# Print indexed data frame                   
print(sliced_rows)

And the result is:

            Country  Population
City                           
New York        USA     8623000
Los Angeles     USA     4000000
Chicago         USA     2716000

Note: When we index a column as we did, the column name "drops down," meaning it's no longer at the same level as the names of the other columns, as we can see. In these cases, the indexed column ("City", in this case) can't be accessed as we do with columns in Pandas anymore, until we restore it as a column.

So, if we want to restore the classical indexing method, restoring the indexed column(s) as column(s), we can type the following:

# Create a new data frame with restored indexes
df_reset = df.reset_index()

# Print new, restored data frame
print(df_reset)

And we get:

          City Country  Population
0     New York     USA     8623000
1  Los Angeles     USA     4000000
2      Chicago     USA     2716000
3      Houston     USA     2302000

So, in this case, we've created a new DataFrame called df_reset with the method reset_index(), which has restored the indexes, as we can see.

Sorting Indexes

Pandas also gives us the possibility to sort indexes in descending order (the ascending order is the standard one) by using the sort_index() method like so:

import pandas as pd

data = {
    'B': [6, 7, 8, 9, 10],
    'A': [1, 2, 3, 4, 5],
    'C': [11, 12, 13, 14, 15]
}

df = pd.DataFrame(data)

# Sort the DataFrame based on the index in ascending order
df_sorted = df.sort_index(ascending=False)

# Print sorted df
print(df_sorted)

And this results in:

    B  A   C
4  10  5  15
3   9  4  14
2   8  3  13
1   7  2  12
0   6  1  11

This methodology can even be used when we rename indexes or when we index a column. For example, say we want to rename the indexes and sort them in descending order:

import pandas as pd

data = {
    'B': [6, 7, 8, 9, 10],
    'A': [1, 2, 3, 4, 5],
    'C': [11, 12, 13, 14, 15]
}

df = pd.DataFrame(data, index=["row 1", "row 2", "row 3", "row 4", "row 5"])

# Sort the data frame based on the index in ascending order
df_sorted = df.sort_index(ascending=False)

# Print sorted df
print(df_sorted)

And we have:

        B  A   C
row 5  10  5  15
row 4   9  4  14
row 3   8  3  13
row 2   7  2  12
row 1   6  1  11

So, to achieve this result, we use the sort_index() and pass the ascending=False parameter to it.

Conclusions

In this article, we've shown different methodologies to index Pandas data frames.

Some methodologies yield results similar to others, so the choice has to be made keeping in mind the exact result we want to achieve when we're manipulating our data.

Last Updated: August 1st, 2023
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Federico TrottaAuthor

I'm a Mechanical Engineer who's worked in the industrial field for 5 years.

Thanks to my curiosity, I discovered programming and AI. Having a burning passion for writing, I couldn't avoid starting to write about these topics, so I decided to change my career to become a Technical Writer.

My purpose is to educate people on Python programming, Machine Learning, and Data Science, through writing.

Want to contact me? Here are my contacts: https://bio.link/federicotrotta

Project

Building Your First Convolutional Neural Network With Keras

# python# artificial intelligence# machine learning# tensorflow

Most resources start with pristine datasets, start at importing and finish at validation. There's much more to know. Why was a class predicted? Where was...

David Landup
David Landup
Details
Course

Data Visualization in Python with Matplotlib and Pandas

# python# pandas# matplotlib

Data Visualization in Python with Matplotlib and Pandas is a course designed to take absolute beginners to Pandas and Matplotlib, with basic Python knowledge, and...

David Landup
David Landup
Details

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms