[ ISLR ] Chapter 05 - Use R to Resample

  • As is described in chapter 5, various method can be used to estimate test error rate
  • This notes would covers The validation set approach, LOOCV,k-Fold CV, Bootstrap

The Validation Set Approach

Divide the Subset

1
2
3
library(ISLR)
set.seed(1)
train=sample(392,196)

Fit the Model

1
lm.fit=lm(mpg~horsepower,data=Auto,subset=train)

Get MSE

1
2
attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-train]^2) #The reason why auto is inputed: including the validation set

Fit the Polynomial Model

1
2
3
4
5
lm.fit2 = lm(mpg~poly(horsepower,2),data = Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)

lm.fit3 = lm(mpg~poly(horsepower,3),data = Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)

Alter the Training Set

1
2
3
4
5
6
7
8
9
set.seed(2)
train = sample(392,196)

lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
lm.fit2 = lm(mpg~poly(horsepower,2),data = Auto,subset=train)
mean((mpg-predict(lm.fit2,Auto))[-train]^2)
lm.fit3 = lm(mpg~poly(horsepower,3),data = Auto,subset=train)
mean((mpg-predict(lm.fit3,Auto))[-train]^2)

LOOCV

Perform Regression

  • Perform linear regression using glm
1
glm.fit = glm(mpg~horsepower,data=Auto)

Get LOOCV Statistic

1
2
3
library(boot)
cv.err = cv.glm(Auto,glm.fit)
cv.err$delta
  • The test error get above is about 24.23
  • The first is the standard CV estimate, while the other is a Bias corrected version
1
2
3
4
5
6
cv.error = rep(0,5)
for (i in 1:5){
glm.fit = glm(mpg~poly(horsepower,i),data=Auto)
cv.error[i] = cv.glm(Auto,glm.fit)$delta[1]
}
cv.error
  • The result is: 24.23151 19.24821 19.33498 19.42443 19.03321

k-Fold Test

1
2
3
4
5
6
7
set.seed(17)
cv.error.10 = rep(0,10)
for (i in 1:10){
glm.fit=glm(mpg~poly(horsepower,i),data=Auto)
cv.error.10[i] = cv.glm(Auto,glm.fit,K=10)$delta[1]
}
cv.error.10

The Bootstrap

Procedure

  1. Writing a function calculating the estimator

  2. boot() from library boot to perform bootstrap

Write the function:

1
2
3
4
5
6
7
> alpha.fn = function(data,index){
+ X=data$X[index]
+ Y=data$Y[index]
+ return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
+ }

alpha.fn(Portfolia,1:100) #GET THE RESULT FOR ALL 100 Obervations

Use Sample() to Resample

1
2
set.seed(1)
alpha.fn(Portfolio,sample(100,100,replace=T)) #replace: can repeat?,100: range, time

Use boot() to Simplify the Process

1
boot(Portfolio,alpha.fn,R=1000)

Estimate the Accuracy of an LR

Write Function

1
2
> boot.fn = function(data,index)
+ return(coef(lm(mpg~horsepower,data=data,subset=index)))

All Observations:

1
boot.fn(Auto,1:392)

Random Sampling:

1
2
boot.fn(Auto,sample(392,392,replace=T))
boot.fn(Auto,sample(392,392,replace=T))
  • Bootstrap:
  • boot(Auto,boot.fn,1000)

Compare the Result

  • Because the formula to compute $SE(\hat{\beta}_0)$ depends on some assumptions but bootstrap does not.
  • So Bootstrap is a better estimate