Introduction
In the previous article, we studied how we can use filter methods for feature selection for machine learning algorithms. Filter methods are handy when you want to select a generic set of features for all the machine learning models.
However, in some scenarios, you may want to use a specific machine learning algorithm to train your model. In such cases, features selected through filter methods may not be the most optimal set of features for that specific algorithm. There is another category of feature selection methods that select the most optimal features for the specified algorithm. Such methods are called wrapper methods.
Wrapper Methods for Feature Selection
Wrapper methods are based on greedy search algorithms as they evaluate all possible combinations of the features and select the combination that produces the best result for a specific machine learning algorithm. A downside to this approach is that testing all possible combinations of the features can be computationally very expensive, particularly if the feature set is very large.
As said earlier, wrapper methods can find the best set of features for a specific algorithm - however, a downside is that these set of features may not be optimal for every other machine learning algorithm.
Wrapper methods for feature selection can be divided into three categories: Step forward feature selection, Step backwards feature selection and Exhaustive feature selection. In this article, we will see how we can implement these feature selection approaches in Python.
Step Forward Feature Selection
In the first phase of the step forward feature selection, the performance of the classifier is evaluated with respect to each feature. The feature that performs the best is selected out of all the features.
In the second step, the first feature is tried in combination with all the other features. The combination of two features that yield the best algorithm performance is selected. The process continues until the specified number of features are selected.
Let's implement step forward feature selection in Python. We will be using the BNP Paribas Cardif Claims Management dataset for this section as we did in our previous article.
To implement step forward feature selection, we need to convert categorical feature values into numeric feature values. However, for the sake of simplicity, we will remove all the non-categorical columns from our data. We will also remove the correlated columns as we did in the previous article so that we have a small feature set to process.
Data Preprocessing
The following script imports the dataset and the required libraries, it then removes the non-numeric columns from the dataset and then divides the dataset into training and testing sets. Finally, all the columns with a correlation of greater than 0.8 are removed. Take a look at this article for the detailed explanation of this script:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
paribas_data = pd.read_csv(r"E:\Datasets\paribas_data.csv", nrows=20000)
paribas_data.shape
num_colums = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_columns = list(paribas_data.select_dtypes(include=num_colums).columns)
paribas_data = paribas_data[numerical_columns]
paribas_data.shape
train_features, test_features, train_labels, test_labels = train_test_split(
paribas_data.drop(labels=['target', 'ID'], axis=1),
paribas_data['target'],
test_size=0.2,
random_state=41)
correlated_features = set()
correlation_matrix = paribas_data.corr()
for i in range(len(correlation_matrix .columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > 0.8:
colname = correlation_matrix.columns[i]
correlated_features.add(colname)
train_features.drop(labels=correlated_features, axis=1, inplace=True)
test_features.drop(labels=correlated_features, axis=1, inplace=True)
train_features.shape, test_features.shape
Implementing Step Forward Feature Selection in Python
To select the most optimal features, we will be using SequentialFeatureSelector
function from the mlxtend library. The library can be downloaded executing the following command at anaconda command prompt:
conda install -c conda-forge mlxtend
We will use the Random Forest Classifier to find the most optimal parameters. The evaluation criteria used will be ROC-AUC. The following script selects the 15 features from our dataset that yields best performance for random forest classifier:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector
feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
k_features=15,
forward=True,
verbose=2,
scoring='roc_auc',
cv=4)
In the script above we pass the RandomForestClassifier
as the estimator to the SequentialFeatureSelector
function. The k_features
specifies the number of features to select. You can set any number of features here. The forward
parameter, if set to True
, performs step forward feature selection. The verbose
parameter is used for logging the progress of the feature selector, the scoring
parameter defines the performance evaluation criteria and finally, cv
refers to cross-validation folds.
We created our feature selector, now we need to call the fit
method on our feature selector and pass it the training and test sets as shown below:
features = feature_selector.fit(np.array(train_features.fillna(0)), train_labels)
Depending upon your system hardware, the above script can take some time to execute. Once the above script finishes executing, you can execute the following script to see the 15 selected features:
filtered_features= train_features.columns[list(features.k_feature_idx_)]
filtered_features
In the output, you should see the following features:
Index(['v4', 'v10', 'v14', 'v15', 'v18', 'v20', 'v23', 'v34', 'v38', 'v42',
'v50', 'v51', 'v69', 'v72', 'v129'],
dtype='object')
Now to see the classification performance of the random forest algorithm using these 15 features, execute the following script:
clf = RandomForestClassifier(n_estimators=100, random_state=41, max_depth=3)
clf.fit(train_features[filtered_features].fillna(0), train_labels)
train_pred = clf.predict_proba(train_features[filtered_features].fillna(0))
print('Accuracy on training set: {}'.format(roc_auc_score(train_labels, train_pred[:,1])))
test_pred = clf.predict_proba(test_features[filtered_features].fillna(0))
print('Accuracy on test set: {}'.format(roc_auc_score(test_labels, test_pred [:,1])))
In the script above, we train our random forest algorithm on the 15 features that we selected using the step forward feature selection and then we evaluated the performance of our algorithm on the training and testing sets. In the output, you should see the following results:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Accuracy on training set: 0.7072327148174093
Accuracy on test set: 0.7096973252804142
You can see that the accuracy on training and test sets is pretty similar which means that our model is not overfitting.
Step Backwards Feature Selection
Step backwards feature selection, as the name suggests is the exact opposite of step forward feature selection that we studied in the last section. In the first step of the step backwards feature selection, one feature is removed in round-robin fashion from the feature set and the performance of the classifier is evaluated.
The feature set that yields the best performance is retained. In the second step, again one feature is removed in a round-robin fashion and the performance of all the combination of features except the 2 features is evaluated. This process continues until the specified number of features remain in the dataset.
Step Backwards Feature Selection in Python
In this section, we will implement the step backwards feature selection on the BNP Paribas Cardif Claims Management. The preprocessing step will remain the same as the previous section. The only change will be in the forward
parameter of the SequentiaFeatureSelector
class. In case of the step backwards feature selection, we will set this parameter to False
. Execute the following script:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector
feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1),
k_features=15,
forward=False,
verbose=2,
scoring='roc_auc',
cv=4)
features = feature_selector.fit(np.array(train_features.fillna(0)), train_labels)
To see the feature selected as a result of step backwards elimination, execute the following script:
filtered_features= train_features.columns[list(features.k_feature_idx_)]
filtered_features
The output looks like this:
Index(['v7', 'v8', 'v10', 'v17', 'v34', 'v38', 'v45', 'v50', 'v51', 'v61',
'v94', 'v99', 'v119', 'v120', 'v129'],
dtype='object')
Finally, let's evaluate the performance of our random forest classifier on the features selected as a result of step backwards feature selection. Execute the following script:
clf = RandomForestClassifier(n_estimators=100, random_state=41, max_depth=3)
clf.fit(train_features[filtered_features].fillna(0), train_labels)
train_pred = clf.predict_proba(train_features[filtered_features].fillna(0))
print('Accuracy on training set: {}'.format(roc_auc_score(train_labels, train_pred[:,1])))
test_pred = clf.predict_proba(test_features[filtered_features].fillna(0))
print('Accuracy on test set: {}'.format(roc_auc_score(test_labels, test_pred [:,1])))
The output looks likes that:
Accuracy on training set: 0.7095207938140247
Accuracy on test set: 0.7114624676445211
You can see that the performance achieved on the training set is similar to that achieved using step forward feature selection. However, on the test set, backward feature selection performed slightly better.
Exhaustive Feature Selection
In exhaustive feature selection, the performance of a machine learning algorithm is evaluated against all possible combinations of the features in the dataset. The feature subset that yields best performance is selected. The exhaustive search algorithm is the most greedy algorithm of all the wrapper methods since it tries all the combination of features and selects the best.
A downside to exhaustive feature selection is that it can be slower compared to step forward and step backward method since it evaluates all feature combinations.
Exhaustive Feature Selection in Python
In this section, we will implement the step backwards feature selection on the BNP Paribas Cardif Claims Management. The preprocessing step will remain the similar to that of Step forward feature selection.
To implement exhaustive feature selection, we will be using ExhaustiveFeatureSelector
function from the mlxtend.feature_selection
library. The class has min_features
and max_features
attributes which can be used to specify the minimum and the maximum number of features in the combination.
Execute the following script:
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score
feature_selector = ExhaustiveFeatureSelector(RandomForestClassifier(n_jobs=-1),
min_features=2,
max_features=4,
scoring='roc_auc',
print_progress=True,
cv=2)
We created our feature selector, now need to call the fit
method on our feature selector and pass it the training and test sets as shown below:
features = feature_selector.fit(np.array(train_features.fillna(0)), train_labels)
Note that the above script can take quite a bit of time to execute. To see the feature selected as a result of step backwards elimination, execute the following script:
filtered_features= train_features.columns[list(features.k_feature_idx_)]
filtered_features
Finally, to see the performance of random forest classifier on the features selected as a result of exhaustive feature selection. Execute the following script:
clf = RandomForestClassifier(n_estimators=100, random_state=41, max_depth=3)
clf.fit(train_features[filtered_features].fillna(0), train_labels)
train_pred = clf.predict_proba(train_features[filtered_features].fillna(0))
print('Accuracy on training set: {}'.format(roc_auc_score(train_labels, train_pred[:,1])))
test_pred = clf.predict_proba(test_features[filtered_features].fillna(0))
print('Accuracy on test set: {}'.format(roc_auc_score(test_labels, test_pred [:,1])))
Conclusion
Wrapper methods are some of the most important algorithms used for feature selection for a specific machine learning algorithm. In this article, we studied different types of wrapper methods along with their practical implementation. We studied step forward, step backwards and exhaustive methods for feature selection.
As a rule of thumb, if the dataset is small, exhaustive feature selection method should be the choice, however, in case of large datasets, step forward or step backward feature selection methods should be preferred.