RandomizedSearchCV with XGBoost in Scikit-Learn Pipeline

RandomizedSearchCV and GridSearchCV allow you to perform hyperparameter tuning with Scikit-Learn, where the former searches randomly through some configurations (dictated by n_iter) while the latter searches through all of them.

XGBoost is an increasingly dominant library, whose regressors and classifiers are doing wonders over more traditional implementations, and is based on an extreme version of gradient boosting.

It plays well with Scikit-Learn and its models can in most cases be used in place of Scikit-Learn models.

In this Byte - you'll find an end-to-end example of a Scikit-Learn pipeline to scale data, fit an XGBoost's XGBRegressor and then perform hyperparameter tuning with Scikit-Learn's RandomizedSearchCV.

First, let's create a baseline performance from a pipeline:

import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

from xgboost import XGBRegressor

X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([('scaler', MinMaxScaler()), ('regressor', XGBRegressor())])
pipeline.fit(X_train, y_train)

Let's score it:

pipeline.score(X_test, y_test)
# 0.7752702631435812

Awesome, an R2 score of 0.77! Let's perform hyperparameter tuning on the pipeline. You can access each element of the pipeline to set their properties by using their name in the pipeline with __. If you're unsure of the names, you can always get all of the configurable params as:

# dict_keys([..., 'scaler__copy', 'scaler__feature_range', 'regressor__base_score', 'regressor__booster', 'regressor__colsample_bylevel', ...

Let's make a hyperparameter grid and initialize the search:

Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

hyperparameter_grid = {
    'regressor__n_estimators': [100, 500, 1000, 2000],
    'regressor__max_depth': [3, 6, 9, 12],
    'regressor__learning_rate': [0.01, 0.03, 0.05, 0.1]

random_cv = sklearn.model_selection.RandomizedSearchCV(estimator=pipeline,
            scoring = 'neg_root_mean_squared_error',
            n_jobs = -1,
            verbose = 5, 
            return_train_score = True,

random_cv.fit(X_train, y_train)

The search will take a bit of time, and results in something along the lines of:

Fitting 3 folds for each of 5 candidates, totalling 15 fits

                   estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                             ('regressor', XGBRegressor())]),
                   n_iter=5, n_jobs=-1,
                   param_distributions={'regressor__learning_rate': [0.01, 0.03,
                                        'regressor__max_depth': [3, 6, 9, 12],
                                        'regressor__n_estimators': [100, 500,
                   random_state=42, return_train_score=True,
                   scoring='neg_root_mean_squared_error', verbose=5)

Once the search is done, you can get the best configured pipeline with:

best_pipe = random_cv.best_estimator_

Or inspect it:

# Pipeline(steps=[('scaler', MinMaxScaler()),
#                ('regressor', XGBRegressor(max_depth=9, n_estimators=1000))])

Finally, let's score it for reference:

best_pipe.score(X_test, y_test)
# 0.8309779703673673

Great, we've increased the score from 0.77 to 0.83 through a simple, short search!

Note: Don't score() the random_cv object, but rather, the best found pipeline.

Was this helpful?
David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

© 2013-2024 Stack Abuse. All rights reserved.