How to Select Columns in Pandas Based on a String Prefix
Introduction
Pandas is a powerful Python library for working with and analyzing data. One operation that you might need to perform when working with data in Pandas is selecting columns based on their string prefix. This can be useful when you have a large DataFrame and you want to focus on specific columns that share a common prefix.
In this Byte, we'll explore a few methods to achieve this, including creating a series to select columns and using DataFrame.loc
.
Select All Columns Starting with a Given String
Let's start with a simple DataFrame:
import pandas as pd
data = {
'item1': [1, 2, 3],
'item2': [4, 5, 6],
'stuff1': [7, 8, 9],
'stuff2': [10, 11, 12]
}
df = pd.DataFrame(data)
print(df)
Output:
item1 item2 stuff1 stuff2
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
To select columns that start with 'item', you can use list comprehension:
selected_columns = [column for column in df.columns if column.startswith('item')]
print(df[selected_columns])
Output:
item1 item2
0 1 4
1 2 5
2 3 6
Creating a Series to Select Columns
Another approach to select columns based on their string prefix is to create a Series object from the DataFrame columns, and then use the str.startswith()
method. This method returns a boolean Series where a True
value means that the column name starts with the specified string.
selected_columns = pd.Series(df.columns).str.startswith('item')
print(df.loc[:, selected_columns])
Output:
item1 item2
0 1 4
1 2 5
2 3 6
Using DataFrame.loc to Select Columns
The DataFrame.loc
method is primarily label-based, but may also be used with a boolean array. The ix
indexer for DataFrame is deprecated now, as it has a number of problems. .loc
will raise a KeyError
when the items are not found.
Consider the following example:
selected_columns = df.columns[df.columns.str.startswith('item')]
print(df.loc[:, selected_columns])
Output:
item1 item2
0 1 4
1 2 5
2 3 6
Here, we first create a boolean array that is True
for columns starting with 'item'. Then, we use this array to select the corresponding columns from the DataFrame using the .loc
indexer. This method is more efficient than the previous ones, especially for large DataFrames, as it avoids creating an intermediate list or Series.
Applying DataFrame.filter() for Column Selection
The filter()
function in pandas DataFrame provides a flexible and efficient way to select columns based on their names. It is especially useful when dealing with large datasets with many columns.
The filter()
function allows us to select columns based on their labels. We can use the like
parameter to specify a string pattern that matches the column names. However, if we want to select columns based on a string prefix, we can use the regex
parameter.
Here's an example:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'product_id': [101, 102, 103, 104],
'product_name': ['apple', 'banana', 'cherry', 'date'],
'product_price': [1.2, 0.5, 0.75, 1.3],
'product_weight': [150, 120, 50, 60]
})
# Select columns that start with 'product'
df_filtered = df.filter(regex='^product')
print(df_filtered)
This will output:
product_id product_name product_price product_weight
0 101 apple 1.20 150
1 102 banana 0.50 120
2 103 cherry 0.75 50
3 104 date 1.30 60
In the above code, the ^
symbol is a regular expression that matches the start of a string. Therefore, '^product'
will match all column names that start with 'product'.
Next: The filter()
function returns a new DataFrame that shares the data with the original DataFrame. So, any modifications to the new DataFrame will not affect the original DataFrame.
Conclusion
In this Byte, we explored different ways to select columns in a pandas DataFrame based on a string prefix. We learned how to create a Series and use it to select columns, how to use the DataFrame.loc
function, and how to apply the DataFrame.filter()
function. Of course, each of these methods has its own advantages and use cases. The choice of method depends on the specific requirements of your data analysis task.