Cross Validation in Machine Learning
For humans, generalization is the most natural thing in the world. We can classify on the fly. For example, we would certainly recognize a bird even if we had never seen another like it. However, this is not at all obvious for a Machine Learning model.
A model trained on a dataset is able to predict the labels of its elements. But once tested on other elements, it is not really certain that it will provide the correct answer. This is why verifying the algorithm's ability to generalize is an important task that requires a lot of attention when building the model. To do this, cross-validation is used.
Cross-validation is commonly used in Machine Learning to compare different models and select the most appropriate one for a specific problem. It is both easy to understand and implement and less biased than other methods.
In this article, we will detail the notion of validating a predictive model, the possible methods and techniques, and how and when to use them.
Why use cross-validation?
Training a predictive model and testing it on the same data is a mistake: a model that simply repeats the labels of the samples it just saw would have a perfect score.
The model in this case may overfit the data. This happens because instead of understanding the underlying structure of the data, it "memorizes" the peculiarities of that particular dataset and a problem arises when the model is deployed with data it has never seen before.
That's why it's very important to test the stability of a Machine Learning model by evaluating its performance with previously unseen data. Based on the results of this test, one can judge whether overfitting or underfitting has occurred, or whether the model is capable of generalizing.
There are validation techniques to evaluate a model's performance on different splits of data in order to mitigate issues like these, and to select the appropriate model for the predictive modeling problem at hand. This is called cross-validation.
Cross-validation has three main objectives:
1. Evaluate the ability of a predictive model to generalize and produce correct predictions with any data.
2. Tune the hyperparameters of a model in order to achieve the best possible results. This involves testing the model's performance with different configurations and choosing the best one based on the results of cross-validation.
3. Compare multiple models to select the most appropriate one for the modeling problem, since the choice of prediction algorithm depends on the type and characteristics of the dataset in question.
Although there are several cross-validation methods, they share fundamental principles and each is suited to specific situations. Examples include K-Fold, Shuffle Split, and Leave-One-Out.
Cross-validation techniques for Machine Learning
There are several ways to evaluate Machine Learning models. Here are some examples:
1. Training and testing sets
This is the simplest and most common technique. You may not know that it is a cross-validation method, but you certainly use it every day.
The idea is to split our dataset into two parts: a training set and a testing set. Usually, the training set is about 70% of the dataset and the testing set is 30%, but you can choose the split that best suits your dataset. The testing set is not used to train the model, but only to evaluate its performance.
We first train our model with the first group and then use the testing group to give it a score and evaluate its performance with data it hasn't seen before.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
This method is very effective, especially with large datasets.
K-Fold
In this technique, the dataset is divided into k almost equal-sized sets. The first set is selected as the test set, and the model is trained on the other sets k−1. The error rate is then calculated after fitting the model to the test data.
Corrected translation:
In the second iteration, the second set is selected as the test set and the remaining k-1 sets are used for training, and the error rate is calculated again. This process continues for all k sets.
The final score to assign to the model can be found by averaging the performance obtained on the k folds.
import numpy as np
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train,y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]
To avoid this problem, we can use stratification. This is a variation of K-Fold that returns stratified data sets, meaning that the percentage of samples for each class is preserved. This ensures that the test set and the training set have similar class distributions, which is crucial for accurate classification models.
Now, the second observation is selected as test data, and the model is trained on the remaining data. This process is repeated for all observations. Each time, the model is trained on the data and we check if it was able to correctly predict the observation that was left out.
import numpy as np
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]
The problem with Leave One Out is that it greatly increases the computation time since we train as many models as the number of points in the complete set. It is not recommended to use this method when working with a large volume of data as it will be difficult to draw conclusions.
One possible solution to this problem is to use Leave P Out. This method creates all possible training and test sets using p samples as the test set. P represents the number of elements in the test group to leave aside each time.
At the code level, it is sufficient to use the LeavePOut module of the Scikit-Learn library instead of LeaveOneOut and then perform the same process. The outputs this time will be distinct groups for both the training and test sets.
The difference between this method and K-Fold is that Leave P Out creates non-overlapping test sets.
Monte Carlo (Shuffle-Split)
Monte Carlo (Shuffle-Split) cross-validation is a highly flexible strategy. In this technique, data sets are randomly divided into training and test sets. It also allows us to decide the percentage of the dataset that we want to use as the training set and the percentage to use as the validation set. If the added percentage of the size of the training and validation set is not equal to 100, the remaining part of the dataset is not used.
import numpy as np
from sklearn.model_selection import ShuffleSplit
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
y = np.array([1, 2, 1, 2, 1, 2])
rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25,random_state=0)
for train_index, test_index in rs.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [1 3 0] TEST: [5 2]
TRAIN: [4 0 2] TEST: [1 3]
TRAIN: [1 2 4] TEST: [3 5]
TRAIN: [3 4 1] TEST: [5 2]
TRAIN: [3 5 1] TEST: [2 4]
StratifiedShuffleSplit is a combination of StratifiedKFold and ShuffleSplit, which returns stratified random sets. The sets are made by keeping the percentage of samples for each class.
This method can be used to ensure that the obtained sets do not have the same labels, but it does not guarantee that they are not repeated in certain iterations, which is very likely in the case of small datasets. Therefore, it is advisable to use this technique with large datasets.
Time Series:
Traditional cross-validation techniques do not work on sequential data such as time series because random data points cannot be chosen and assigned to the test or training set, as it does not make sense to use future values to predict past values.
There are mainly two ways to proceed.
- The rolling method:
In this method, we start with a small subset of data as a training set, make predictions for future data points, and then check the accuracy of the predicted data points. The same predicted data points are then included in the next training data set, and subsequent data points are predicted.
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]
The blocking method:
The idea is to divide the training set into two folds at each iteration, as long as the validation set is always ahead of the training set. During the first iteration, we train the candidate model, for example, on data from January to February and validate with data from March, and for the next iteration, we train on data from April to June and validate on data from August, and so on until the end of the training set. In this way, the time dependency is respected.
Cross-validation for Deep Learning:
For Deep Learning models, we generally do not use techniques that require training a model multiple times, as it will be very costly in terms of computation time.
The most commonly used method, in this case, is to divide the dataset into three parts:
- Training set: a portion of the dataset for training the model.
- Validation set: a portion of the dataset for validation during training.
- Test set: a portion of the dataset for the final validation of the model.
When working with simple or fast-to-train neural networks, we can use other methods such as K-fold. In this case, there is no difference between Machine Learning and Deep Learning.
In summary, cross-validation is a technique for evaluating a model and testing its performance. It is a powerful tool that is very easy to use. The most important thing is to be logical when choosing the percentage divisions of your dataset and to choose the right approach for each use case.