Optimizing Models: Cross-Validation and Hyperparameter Tuning Guide

Introduction

Machine learning is a type of artificial intelligence that allows models to learn from data by identifying patterns in existing datasets and using them to make predictions on unseen or unknown data.

Model Generalization is a crucial trait that must be present in ML models trained and deployed in production. This means the model training should lead to a correct fit on the dataset, preventing overfitting or underfitting.

Model Generalization: Importance

Model Generalization refers to the ability of the model to accurately predict outputs from unseen data or real-time data. Generalization is vital for production models because it enables them to handle dynamic data, i.e., real-time data, and makes them less prone to noise or errors. Businesses use this information for important decision-making.

Risks of Overfitting and Underfitting

Overfitting and underfitting are two primary risks to model generalization.

Overfitting occurs when the model learns from the training data too well. It will learn too many specifics from the training dataset and won't generalize for testing or validation datasets. As a result, its performance on real-time data will be poor since no general patterns are identified. This usually happens when the model is trained on a small amount of data or the model is complex.

Underfitting occurs when the model doesn't learn from the data enough and doesn't accurately understand the patterns from the training data. This happens when the model is trained on too much data or the model is too simple with few parameters.

For improved model training and selection, let's discuss some crucial data and modeling-related aspects that can be controlled for better model generalization. Some of the techniques include cross-validation, hyperparameter tuning, and ensemble methods, which can be used to choose the best model for deployment in the production environment.

Cross-Validation: Introduction

Cross-Validation (CV) is a technique used in machine learning to assess the generalization capability and performance of a machine learning model. This involves creating multiple subsets of datasets called folds and iteratively performing training and evaluation of models on different training and testing datasets each time.

The main goal of CV is to check the model's performance on unseen data by using a portion of the dataset as the test set during the training process.

Let's explore some popular cross-validation techniques.

Popular CV Techniques:

  1. K-Fold

K-Fold is a popular cross-validation technique, where the total dataset is split into k-folds or subsets of equal sizes, and the kth fold is used for testing while the remaining k-1 folds are used as the training dataset.

from sklearn.model_selection import KFold
from sklearn import datasets

# Load the X and y data from the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Perform K-fold split
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True)

# Iterate over splits and perform model training
for train_idx, test_idx in kf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Further training etc
    # ...

In the above code, we have taken features and labels in the form of X and y arrays and split them into training and testing datasets for 5 different iterations.

This technique is useful when the dataset is small and computational resources are available.

  1. Holdout Method

One of the simplest cross-validation techniques, the holdout method splits the dataset into training and testing datasets using a predefined 70/30 or 80/20 train/test split.

from sklearn.model_selection import train_test_split
from sklearn import datasets

# Loading features and labels
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Perform the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In the above code, we have split the dataset into single training and testing sets.
This technique is useful when the dataset is large and there are fewer computational resources available.

  1. Stratified K-Fold

Stratified K-Fold is an extended version of the K-Fold technique that considers the class imbalance present in the dataset. It ensures that every fold of the dataset contains the same proportion of each class, thereby maintaining the class distribution in both the training and testing datasets.

from sklearn.model_selection import StratifiedKFold
from sklearn import datasets

# Loading features and labels
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Perform the splits
k = 5 # Number of folds
skf = StratifiedKFold(n_splits=k, shuffle=True)

# Iterate over splits and perform model training
for train_idx, test_idx in skf.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Further training etc
    # ...

The implementation is similar to K-Fold. This is the best cross-validation method to be used for classification tasks with unbalanced class distribution.

Combine Hyperparameter Tuning with CV

Hyperparameter tuning is a process of selecting the optimal values for hyperparameters of the machine learning model. The values are determined after iterating through different combinations of hyperparameter values with a model and comparing the metrics/evaluation results.

When coupled with cross-validation techniques, this results in training more robust ML models. Some of the popular hyperparameter tuning techniques are discussed below.

Grid Search Cross-Validation

Grid Search Cross-Validation is a popular tuning technique that chooses the best set of hyperparameters for a model by iterating and evaluating through all possible combinations of given parameters.

A hyperparameter grid in the form of a Python dictionary with names and values of parameter names must be passed as input.

# Define hyperparameter grid for RandomForest Classifier
param_grid = { 'n_estimators': [50, 100, 200], 
              'max_depth': [None, 5, 10], 
              'min_samples_split': [2, 5, 10] }
Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

In the above example, we have defined the parameter grid with n_estimators, max_depth, and min_samples_split, along with a list of supported values for each.

Grid Search CV iterates through the model using all possible combinations of parameters from the parameter set and performs evaluation simultaneously. Then, the best model and parameters are returned based on the highest scores obtained.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Create the Random Forest classifier
rf_classifier = RandomForestClassifier()

# Perform grid search cross-validation
grid_search = GridSearchCV(estimator=rf_classifier,
                           param_grid=param_grid,
                           cv=5)
# Fit the model
grid_search.fit(X, y)

# Print the best hyperparameters and the corresponding mean cross-validated score
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

In the above code, a random forest classifier model is initialized and passed as input along with a parameter grid to Grid Search CV. The cv parameter defines the number of cross-validation folds to be created for model training and evaluation.

Finally, a model is trained by calling the fit method and passing the features and labels. The best set of hyperparameters and corresponding scores can be accessed using the best_params_ and best_score_ properties. Since the model is fit for all different combinations of hyperparameters, this process is expensive in terms of computational power required and total execution time taken.

Randomized Search CV

Randomized Search CV is a modified version of Grid Search CV. Unlike Grid Search, which exhaustively trains the model for all combinations from param_grid, Randomized Search samples random combinations for a predefined number of iterations from the hyperparameter space param_grid.

The Random Search CV aims to find the best model parameters by training and evaluating the model for the specified number of iterations.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

# Define hyperparameter grid for RandomForest Classifier
param_grid = { 'n_estimators': [50, 100, 200],
              'max_depth': [None, 5, 10],
              'min_samples_split': [2, 5, 10] }

# Create the Random Forest classifier
rf_classifier = RandomForestClassifier()

# Perform random search cross-validation
random_search = RandomizedSearchCV(estimator=rf_classifier,
                                   param_distributions=param_grid,
                                   cv=5,
                                   n_iter=10)

random_search.fit(X, y)

# Print the best hyperparameters and the corresponding mean cross-validated score
print("Best Hyperparameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

In the above code, we specify an additional n_iter argument to define the number of iterations the RandomSearch CV can go through to find the optimal set of hyperparameters. This method is more efficient in terms of computational power required and total time taken compared to GridSearchCV.

Nested Cross-Validation

Cross-validation can be used for both hyperparameter tuning and estimating the generalization performance of the model. However, using the same cross-validation for both purposes simultaneously can lead to increased bias, especially when the dataset size is small. Models can easily overfit the dataset with hyperparameter tuning, and the same model will be chosen for evaluation. To prevent this bias, nested cross-validation is used, which incorporates an additional CV loop for model selection.

This method involves two levels of cross-validation:

  • An inner CV for parameter search and an outer CV for best model selection.

  • The outer CV loop defines the dataset splits that the inner CV loop uses to find the best set of hyperparameters by performing GridSearchCV or RandomSearchCV.

  • Then the best scores, parameters, and models are stored and used for training a final model on the entire dataset.

In this way, nested CV provides a reliable estimate of model performance while avoiding overfitting.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define model
rf = RandomForestClassifier()

# Define hyperparameter grid for RandomForest Classifier
param_grid = {'n_estimators': [50, 100, 200],
              'max_depth': [None, 5, 10],
              'min_samples_split': [2, 5, 10]}

# Define cross-validation loops
outer_cv = KFold(n_splits=3, shuffle=True, random_state=0)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0)

# Define inner CV for parameter search
model = GridSearchCV(
    estimator=rf, param_grid=param_grid, cv=inner_cv, n_jobs=-1
)

# Define outer CV for model selection and evaluation
scores = cross_val_score(model, X, y,
                        scoring='accuracy',
                        cv=outer_cv, n_jobs=-1)

print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

In the above, we will be able to choose the best model and best parameters without overfitting.

Choosing the Best Model

After performing model training and hyperparameter optimization, we can use the following techniques to select a CV model for production.

Retrain the Whole Dataset with the Best Parameters

After obtaining the best model and best parameters from GridSearchCV results, we can use the same model and hyperparameter combination and train on the entire dataset by combining the validation and testing dataset as well.

This works well when the size of the dataset is small and the best model-parameters combination has achieved better results across all CV splits.

Bagging

Bagging is an ensemble technique where different independent models are trained on various subsets of data, created through bootstrap sampling. Bootstrap sampling involves randomly selecting observations from the original dataset to create a new sample data subset. Bootstrap sampling allows observations to be repeated in a new subset multiple times (replacement). This also means some observations might not be selected at all.

Each independent model is trained independently and generates its predictions. Finally, these predictions are combined using the majority voting (for classification tasks) or average rule (for regression tasks).

The Random Forest algorithm works on the principle of bagging for model training and inference.

Boosting

Boosting is a technique where a sequence of models called weak learners are trained, with each subsequent model focusing on correcting the mistakes of the previous model.

The training data is weighted, and more emphasis is given to misclassified instances. In the end, all predictions are combined using majority voting (classification) or the average rule (regression).

Boosting has become a go-to technique for classification and regression tasks for tabular data because it also handles outliers and efficiently improves the model's mistakes over the training procedure.

The popular XgBoost and CatBoost libraries use the boosting technique to train models.

Stacking Techniques

Stacking is an advanced ensemble technique that involves training multiple base models on training data. These base models then generate individual predictions for the validation dataset.

Instead of combining these model outputs directly, stacking introduces a meta-learner that takes generated predictions as input and predicts the final output. This method learns from the strengths and weaknesses of individual base models and improves the final score.

Blending Techniques

Blending is an ensemble technique that involves training multiple base models on training data, and predictions are combined using a predefined blending function. This blending function can be a simple averaging function or a weighted average of generated predictions.

The final predictions are generated based on the combined model predictions using the defined blending function. This ensemble technique is easier to implement and is relatively a direct ensemble method.

Choosing CV Models for Production

We have explored some significant cross-validation techniques to train models robustly, along with various ensemble methods that help in selecting the best model for production purposes.

However, before choosing any single technique, it is crucial to consider the use case under consideration, the size of the dataset, and the amount of computational resources available.

Then, we will be able to choose the appropriate algorithm and CV technique for training and inference of production-ready models.

Last Updated: June 12th, 2023
Was this article helpful?

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms