Get Feature Importance from XGBRegressor with XGBoost

So - you've trained a sparkling regressor using XGBoost! Which features are the most important in the regression calculation? The first step in unboxing the black-box system that a machine learning model can be is to inspect the features and their importance in the regression.

Let's quickly train a mock XGBRegressor on a toy dataset:

from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_train_scaled.shape)

The shape of X_train is:

(331, 10)

10 features to learn from and plug into the regression formula. Let's fit the model:

xbg_reg = xgb.XGBRegressor().fit(X_train_scaled, y_train)

Great! Now, to access the feature importance scores, you'll get the underlying booster of the model, via get_booster(), and a handy get_score() method lets you get the importance scores.

As per the documentation, you can pass in an argument which defines which type of score importance you want to calculate:

  • ‘weight’ - the number of times a feature is used to split the data across all trees.
  • ‘gain’ - the average gain across all splits the feature is used in.
  • ‘cover’ - the average coverage across all splits the feature is used in.
  • ‘total_gain’ - the total gain across all splits the feature is used in.
  • ‘total_cover’ - the total coverage across all splits the feature is used in.

That being said - depending on which importance type you want to inspect, you'll get the importance scores as a dictionary as:

Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

xbg_reg.get_booster().get_score(importance_type='gain')

This results in a dictionary of features and their scores:

{'f0': 269.0863952636719,
 'f1': 289.7273254394531,
 'f2': 1493.409912109375,
 'f3': 708.8233642578125,
 'f4': 397.26751708984375,
 'f5': 336.8326110839844,
 'f6': 586.3340454101562,
 'f7': 680.273193359375,
 'f8': 3906.28857421875,
 'f9': 531.477783203125}

Dictionaries are easily convertible into Pandas DataFrames, which are in turn easy to visualize using the underlying integration with Matplotlib:

import pandas as pd
f_importance = xbg_reg.get_booster().get_score(importance_type='gain')

importance_df = pd.DataFrame.from_dict(data=f_importance, 
                                       orient='index')

And now, to plot them:

importance_df.plot.bar()

This results in:

Recall that we've fit the regressor with 10 features - the importance of each displayed in the graph.

If you'd like to read more about Pandas' plotting capabilities in more detail, read our "Guide to Data Visualization in Python with Pandas"!

Was this helpful?
David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

Project

Bank Note Fraud Detection with SVMs in Python with Scikit-Learn

# python# machine learning# scikit-learn# data science

Can you tell the difference between a real and a fraud bank note? Probably! Can you do it for 1000 bank notes? Probably! But it...

David Landup
Cássia Sampaio
Details

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms