Current location - Training Enrollment Network - Mathematics courses - Machine Learning Series (24)-Cross Validation and Variance Balance
Machine Learning Series (24)-Cross Validation and Variance Balance
Earlier, we learned that we can use the test data set to correct the over-fitting phenomenon of the model, but this is actually problematic. Every time we use the test data set to see whether the model is good or bad, if we find that the model is not performing well, we will adjust the model parameters to optimize the model. We have adjusted the parameters for this set of test data sets, so the final model is likely to over-fit the test data sets.

The test set is very precious, which is equivalent to the brand-new data encountered by the model. A really good machine learning model should have a good ability to predict brand-new data, so the test data set generally does not participate in the model creation and training process, and is only used for the final evaluation after the model training is completed.

Therefore, in previous studies, it is not appropriate to divide the training set and the test set. The solution is to divide the data set into training set, verification set and test set. Now the verification set has completed what the test set did in the previous research-adjusting the superparameter, and finally using the test set to evaluate the final performance of the model. Of course, due to individual extreme data, the validation set may be over-fitted, so we have cross-validation.

There is a triple cross-validation, which divides the training data into three parts, two of which are training 1, which is used to verify the adjustment parameters. In this way, three models can be obtained, and the average of the results of these three models is taken as the final result of parameter adjustment, which is much better than only one verification set. Next, we will use knn algorithm to cross-verify the handwritten digital data set and see how the effect is. First, let's look at the situation without cross-validation:

Search results:

Without cross-validation, when the optimal parameters are k=2 and p=2, the accuracy rate reaches 99.2%.

Use cross validation:

Search for the best parameters using cross validation:

Cross-validation of optimal parameters:

Cross-check the accuracy of the optimal model in the test set;

The best accuracy of the optimal model obtained by general cross-validation is slightly lower than that without cross-validation, because there will be over-fitting phenomenon without cross-validation.

In fact, cross-validation is not necessarily divided into three parts, but may be more. This is just an example, and there is corresponding K-fold cross-validation. K times cross-validation is equivalent to training K models, so the overall performance will be K times slower. However, the parameters of this kind of training will be more reliable. In an extreme case, it is to leave a Loo-CV (leave-one-out-cross-validation), that is, k is equal to the number of samples in the training set, which will be completely unaffected by randomness and closest to the real performance index of the model, but the amount of calculation will be huge.

Inevitable errors exist objectively, such as the noise of the data itself, and such an error algorithm is powerless. But deviation and variance can be optimized by some methods. The reason for the deviation is often that the assumption of the problem itself is incorrect. For example, linear regression is used for nonlinear data, and the deviation is usually related to under-fitting. Variance means that a little disturbance of data will have a great impact on the model, usually because the model used is too complicated, such as high-order polynomial regression, and variance is generally related to over-fitting, which will greatly introduce variance.

Some algorithms are high variance algorithms, such as knn, and nonparametric learning is usually high variance algorithms because no assumptions are made on the data. Some algorithms are high deviation algorithms, such as linear regression. Because of the strong assumption of data, parameter learning is usually a high deviation algorithm. Deviation and variance are usually contradictory, reducing deviation will increase variance, and reducing variance will increase deviation. However, in general, the algorithm can balance the deviation and variance appropriately by adjusting the parameters. The main challenge of machine learning comes from variance (solving the over-fitting problem)! , the means to solve the high variance generally have the following:

Among them, model regularization is a very common and important means to reduce over-fitting in machine learning, which will be introduced in the next chapter.