To estimate test error properly, there are different ways to do this. For example, some methods make a mathematical adjustment to the training error rate to estimate the test error rate. Other methods like cross validation, holds out a subset of the training observation from the fitting process.
The notes will cover leave on out, k-fold, and bias-variance trade off of Cross Validation, and Bootstrap.

Cross-Validation

The Validation Set Approach

Random split the data ten times, then use each of them to test the polynomial formula of different powers.
The result:


The approach	The result

Drawbacks:
- As is shown in the right column of the table, the estimate of test error is very variable
- In the validation set approach, only a subset of the observations are used to fit the model, resulting an over estimate of the test error rate compared to use the whole dataset to train the model.


The approach	The result

The estimate of the test error rate is: $CV_{(n)} = \frac{1}{n}\sum^n_{i=1}MSE_i$ of which, $MSE_i = (y_i-\hat{y}_i)^2$
An simplify of the LOOCV if the model is polynomial or least square linear:
- $CV_{(n)} = \frac{1}{n}\sum^n_{i=1}(\frac{y_i-\hat{y}_i}{1-h_i})^2$
- Where hi will be described later in the chapter about linear regression


The approach	The result

the 10-fold CV was run nine separate times, each with a different random split of the data into ten parts

Some times, if we only interest in the location of the parameter in order to minimize the test error instead of the test error itself, then loocv and k-fold can always produce a good result.

K-Fold often gives more accurate estimates of the test error rate than does LOOCV

Validation set approach use only a part of the dataset, so it may be very biased to estimate test error rate.
LOOCV give approximately unbiased estimates of the test error, since almost all the observations are used to fit the model.
The k-fold test provides an intermediate bias in estimating the test error rate.

LOOCV trained n models which are highly correlated with each other, resulting in higher variance
K-fold validation train models which are less correlated with each other.
The insight is: the mean of some highly correlated quantities has higher variance.

Suppose we want to minimize the variance of an investment: $Var(\alpha X+(1-\alpha)Y)$
Thus $\alpha = \frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}$ can minimize the above value
Use 1000 times of simulation to estimate $\alpha$, and $\hat{\alpha}_r$ is very close to true $ \alpha$
The standard deviation is $SE(\widehat{\alpha}) = \sqrt{\frac{1}{1000}\sum^{1000}_{r=1}(\hat{\alpha}_r-\bar{\alpha})^2}$
The result of simulation is the orange one: