Get Feature Importance from XGBRegressor with XGBoost
So - you've trained a sparkling regressor using XGBoost! Which features are the most important in the regression calculation? The first step in unboxing the black-box system that a machine learning model can be is to inspect the features and their importance in the regression.
Let's quickly train a mock XGBRegressor
on a toy dataset:
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_train_scaled.shape)
The shape
of X_train
is:
(331, 10)
10 features to learn from and plug into the regression formula. Let's fit the model:
xbg_reg = xgb.XGBRegressor().fit(X_train_scaled, y_train)
Great! Now, to access the feature importance scores, you'll get the underlying booster of the model, via get_booster()
, and a handy get_score()
method lets you get the importance scores.
As per the documentation, you can pass in an argument which defines which type of score importance you want to calculate:
‘weight’
- the number of times a feature is used to split the data across all trees.‘gain’
- the average gain across all splits the feature is used in.‘cover’
- the average coverage across all splits the feature is used in.‘total_gain’
- the total gain across all splits the feature is used in.‘total_cover’
- the total coverage across all splits the feature is used in.
That being said - depending on which importance type you want to inspect, you'll get the importance scores as a dictionary as:
xbg_reg.get_booster().get_score(importance_type='gain')
This results in a dictionary of features and their scores:
{'f0': 269.0863952636719,
'f1': 289.7273254394531,
'f2': 1493.409912109375,
'f3': 708.8233642578125,
'f4': 397.26751708984375,
'f5': 336.8326110839844,
'f6': 586.3340454101562,
'f7': 680.273193359375,
'f8': 3906.28857421875,
'f9': 531.477783203125}
Dictionaries are easily convertible into Pandas DataFrame
s, which are in turn easy to visualize using the underlying integration with Matplotlib:
import pandas as pd
f_importance = xbg_reg.get_booster().get_score(importance_type='gain')
importance_df = pd.DataFrame.from_dict(data=f_importance,
orient='index')
And now, to plot them:
importance_df.plot.bar()
This results in:
Recall that we've fit the regressor with 10 features - the importance of each displayed in the graph.
If you'd like to read more about Pandas' plotting capabilities in more detail, read our "Guide to Data Visualization in Python with Pandas"!
You might also like...
Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.
Great passion for accessible education and promotion of reason, science, humanism, and progress.