Cross Validation and Grid Search for Model Selection in Python

Introduction

A typical machine learning process involves training different models on the dataset and selecting the one with best performance. However, evaluating the performance of algorithm is not always a straight forward task. There are several factors that can help you determine which algorithm performance best. One such factor is the performance on cross validation set and another other factor is the choice of parameters for an algorithm.

In this article we will explore these two factors in detail. We will first study what cross validation is, why it is necessary, and how to perform it via Python's Scikit-Learn library. We will then move on to the Grid Search algorithm and see how it can be used to automatically select the best parameters for an algorithm.

Cross Validation

Normally in a machine learning process, data is divided into training and test sets; the training set is then used to train the model and the test set is used to evaluate the performance of a model. However, this approach may lead to variance problems. In simpler words, a variance problem refers to the scenario where our accuracy obtained on one test is very different to accuracy obtained on another test set using the same algorithm.

The solution to this problem is to use K-Fold Cross-Validation for performance evaluation where K is any number. The process of K-Fold Cross-Validation is straightforward. You divide the data into K folds. Out of the K folds, K-1 sets are used for training while the remaining set is used for testing. The algorithm is trained and tested K times, each time a new set is used as testing set while remaining sets are used for training. Finally, the result of the K-Fold Cross-Validation is the average of the results obtained on each set.

Suppose we want to perform 5-fold cross validation. To do so, the data is divided into 5 sets, for instance we name them SET A, SET B, SET C, SET D, and SET E. The algorithm is trained and tested K times. In the first fold, SET A to SET D are used as training set and SET E is used as testing set as shown in the figure below:

Cross validation

In the second fold, SET A, SET B, SET C, and SET E are used for training and SET D is used as testing. The process continues until every set is at least used once for training and once for testing. The final result is the average of results obtained using all folds. This way we can get rid of the variance. Using standard deviation of the results obtained from each fold we can in fact find the variance in the overall result.

Cross Validation with Scikit-Learn

In this section we will use cross validation to evaluate the performance of Random Forest Algorithm for classification. The problem that we are going to solve is to predict the quality of wine based on 12 attributes. The details of the dataset are available at the following link:

https://archive.ics.uci.edu/ml/datasets/wine+quality

We are only using the data for red wine in this article.

Follow these steps to implement cross validation using Scikit-Learn:

1. Importing Required Libraries

The following code imports a few of the required libraries:

import pandas as pd  
import numpy as np  

2. Importing the Dataset

Download the dataset, which is available online at this link:

https://www.kaggle.com/piyushgoyal443/red-wine-dataset

Once we have downloaded it, we placed the file in the "Datasets" folder of our "D" drive for the sake of this article. The dataset name is "winequality-red.csv". Note that you'll need to change the file path to match the location in which you saved the file on your computer.

Execute the following command to import the dataset:

dataset = pd.read_csv(r"D:/Datasets/winequality-red.csv", sep=';')  

The dataset was semi-colon separated, therefore we have passed the ";" attribute to the "sep" parameter so pandas is able to properly parse the file.

3. Data Analysis

Execute the following script to get an overview of the data:

dataset.head()  

The output looks like this:

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

4. Data Preprocessing

Execute the following script to divide data into label and feature sets.

X = dataset.iloc[:, 0:11].values  
y = dataset.iloc[:, 11].values  

Since we are using cross validation, we don't need to divide our data into training and test sets. We want all of the data in the training set so that we can apply cross validation on that. The simplest way to do this is to set the value for the test_size parameter to 0. This will return all the data in the training set as follows:

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0, random_state=0)  

5. Scaling the Data

If you look at the dataset you'll notice that it is not scaled well. For instance the "volatile acidity" and "citric acid" column have values between 0 and 1, while most of the rest of the columns have higher values. Therefore, before training the algorithm, we will need to scale our data down.

Here we will use the StandardScalar class.

from sklearn.preprocessing import StandardScaler  
feature_scaler = StandardScaler()  
train_features = feature_scaler.fit_transform(train_features)  
test_features = feature_scaler.transform(test_features)  

6. Training and Cross Validation

The first step in the training and cross validation phase is simple. You just have to import the algorithm class from the sklearn library as shown below:

from sklearn.ensemble import RandomForestClassifier  
classifier = RandomForestClassifier(n_estimators=300, random_state=0)  

Next, to implement cross validation, the cross_val_score method of the sklearn.model_selection library can be used. The cross_val_score returns the accuracy for all the folds. Values for 4 parameters are required to be passed to the cross_val_score class. The first parameter is estimator which basically specifies the algorithm that you want to use for cross validation. The second and third parameters, X and y, contain the X_train and y_train data i.e. features and labels. Finally the number of folds is passed to the cv parameter as shown in the following code:

from sklearn.model_selection import cross_val_score  
all_accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)  

Once you've executed this, let's simply print the accuracies returned for five folds by the cross_val_score method by calling print on all_accuracies.

print(all_accuracies)  

Output:

[ 0.72360248  0.68535826  0.70716511  0.68553459  0.68454259 ]

To find the average of all the accuracies, simple use the mean() method of the object returned by cross_val_score method as shown below:

print(all_accuracies.mean())  

The mean value is 0.6972, or 69.72%.

Finally let's find the standard deviation of the data to see degree of variance in the results obtained by our model. To do so, call the std() method on the all_accuracies object.

print(all_accuracies.std())  

The result is: 0.01572 which is 1.57%. This is extremely low, which means that our model has a very low variance, which is actually very good since that means that the prediction that we obtained on one test set is not by chance. Rather, the model will perform more or less similar on all test sets.

Grid Search for Parameter Selection

A machine learning model has two types of parameters. The first type of parameters are the parameters that are learned through a machine learning model while the second type of parameters are the hyper parameter that we pass to the machine learning model.

In the last section, while predicting the quality of wine, we used the Random Forest algorithm. The number of estimators we used for the algorithm was 300. Similarly in KNN algorithm we have to specify the value of K and for SVM algorithm we have to specify the type of Kernel. These estimators - the K value and Kernel - are all types of hyper parameters.

Normally we randomly set the value for these hyper parameters and see what parameters result in best performance. However randomly selecting the parameters for the algorithm can be exhaustive.

Also, it is not easy to compare performance of different algorithms by randomly setting the hyper parameters because one algorithm may perform better than the other with different set of parameters. And if the parameters are changed, the algorithm may perform worse than the other algorithms.

Therefore, instead of randomly selecting the values of the parameters, a better approach would be to develop an algorithm which automatically finds the best parameters for a particular model. Grid Search is one such algorithm.

Grid Search with Scikit-Learn

Let's implement the grid search algorithm with the help of an example. The script in this section should be run after the script that we created in the last section.

To implement the Grid Search algorithm we need to import GridSearchCV class from the sklearn.model_selection library.

The first step you need to perform is to create a dictionary of all the parameters and their corresponding set of values that you want to test for best performance. The name of the dictionary items corresponds to the parameter name and the value corresponds to the list of values for the parameter.

Let's create a dictionary of parameters and their corresponding values for our Random Forest algorithm. Details of all the parameters for the random forest algorithm are available in the Scikit-Learn docs.

To do this, execute the following code:

grid_param = {  
    'n_estimators': [100, 300, 500, 800, 1000],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
}

Take a careful look at the above code. Here we create grid_param dictionary with three parameters n_estimators, criterion, and bootstrap. The parameter values that we want to try out are passed in the list. For instance, in the above script we want to find which value (out of 100, 300, 500, 800, and 1000) provides the highest accuracy.

Similarly, we want to find which value results in the highest performance for the criterion parameter: "gini" or "entropy"? The Grid Search algorithm basically tries all possible combinations of parameter values and returns the combination with the highest accuracy. For instance, in the above case the algorithm will check 20 combinations (5 x 2 x 2 = 20).

The Grid Search algorithm can be very slow, owing to the potentially huge number of combinations to test. Furthermore, cross validation further increases the execution time and complexity.

Once the parameter dictionary is created, the next step is to create an instance of the GridSearchCV class. You need to pass values for the estimator parameter, which basically is the algorithm that you want to execute. The param_grid parameter takes the parameter dictionary that we just created as parameter, the scoring parameter takes the performance metrics, the cv parameter corresponds to number of folds, which is 5 in our case, and finally the n_jobs parameter refers to the number of CPU's that you want to use for execution. A value of -1 for n_jobs parameter means that use all available computing power. This can be handy if you have large number amount of data.

Take a look at the following code:

gd_sr = GridSearchCV(estimator=classifier,  
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=5,
                     n_jobs=-1)

Once the GridSearchCV class is initialized, the last step is to call the fit method of the class and pass it the training and test set, as shown in the following code:

gd_sr.fit(X_train, y_train)  

This method can take some time to execute because we have 20 combinations of parameters and a 5-fold cross validation. Therefore the algorithm will execute a total of 100 times.

Once the method completes execution, the next step is to check the parameters that return the highest accuracy. To do so, print the sr.best_params_ attribute of the GridSearchCV object, as shown below:

best_parameters = gd_sr.best_params_  
print(best_parameters)  

Output:

{'bootstrap': True, 'criterion': 'gini', 'n_estimators': 1000}

The result shows that the highest accuracy is achieved when the n_estimators are 1000, bootstrap is True and criterion is "gini".

Note: It would be a good idea to add more number of estimators and see if performance further increases since the highest allowed value of n_estimators was chosen.

The last and final step of Grid Search algorithm is to find the accuracy obtained using the best parameters. Previously we had a mean accuracy of 69.72% with 300 n_estimators.

To find the best accuracy achieved, execute the following code:

best_result = gd_sr.best_score_  
print(best_result)  

The accuracy achieved is: 0.6985 of 69.85% which is only slightly better than 69.72%. To improve this further, it would be good to test values for other parameters of Random Forest algorithm, such as max_features, max_depth, max_leaf_nodes, etc. to see if the accuracy further improves or not.

Conclusion

In this article we studied two very commonly used techniques for performance evaluation and model selection of an algorithm. K-Fold Cross-Validation can be used to evaluate performance of a model by handling the variance problem of the result set. Furthermore, to identify the best algorithm and best parameters, we can use the Grid Search algorithm.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life