- Last Chapter introduce linear regression as a basic tool for regression analysis. For this chapter, LDA and logit model would be introduced to solve classification problems
- This notes covers: Why not Linear Probability Model?, Logistic Regression (Logit Model), Linear Discriminant Analysis, Comparison between Classification Methods
Why Not Linear Regression?
- Hard to extend the model into a qualitative response with more than two levels
- The predicted value will exceed a reasonable range.
Logistic Regression
The logistic model:
Definition:
$p(X) = \frac{1}{1+e^{\beta_0+\beta_1X}}$, which can be extend to various predictors
Manipulate the above model we get: $log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X$
- The formula: $\frac{p(X)}{1-p(X)}$ is called the odds
- The left side of the equation is called log-odds or logit
- Interpretation: The rate of change in p(X) per unit change in X depends on the current value of X
Estimating coefficients
- Use maximum likelihood.
The intuition: find the best $\beta_0$ and $\beta_1$ to make the observations in the sample are most likely to be chosen from the population.
The likelihood function: $l(\beta_0,\beta_1) = \prod_{i:y_i =1}p(x_i)\prod_{i’:y_{i’}=0}(1-p(x_{i’}))$
- p-value and hypothesis test can be carried as linear regression, but $R^2$ may become meaningless
Multiple Logistic Regression
- A valid way to avoid Omitted variable bias or confounding in statistical learning
For >2 response Class
- Model both $Pr(Y=Class_1|X) $ and $Pr(Y = Class_2 | X)$
- The remaining $Pr(Y=Class_3) = 1-Pr(Y=Class_1|X)-Pr(Y = Class_2 | X)$
Linear Discriminant Analysis
- Bayes’ theorem: $Pr(Y=k|X=x) = \frac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)}$
refer to $p_k(x)$ ,which is the left-hand of the equation as the posterior.
- refer to $\pi_k$ as prior
- refer to $f_k(x)$ as density function, which is the density of x given its class k
Some insights:
- in the theorem, the prior is easier to estimate use the sample, simply count the number of observation and divided by number of all observations. Mathematically, $\widehat{prior} = \frac{n_k}{n}$
- the density however, is much more difficult. In the LDA, we estimate the density by making assumptions
- The objective of LDA is to divide two class as far as possible, and the insight is that we try to use the predictor or axises to maximize the distance between the average value of each class but minimize the variance of each class
LDA For p = 1 (one predictor)
Estimate the density
- $f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k}e^{-\frac{(x-\mu_k)^2}{2\sigma_k^2}}$
- Then taking the log of the equation: $Pr(Y=k|X=x) = \frac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)}$, we can show that it is equivalent to assigning the observation to the class for which $\delta_k(x) = x\cdot\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2}+log(\pi_k)$ is largest
Estimate $\mu_k$ and $\sigma_k$
- For each class: $\hat{\mu}_k = \frac{1}{n}\sum_{i:y_i=k}x_i$
- Then we assume that the variance of predictor is all the same for each class. $\hat{\sigma}^2 = \frac{1}{n-K}\sum^K_{k=1}\sum^{}_{y_i=k}(x_i-\hat{\mu}_k)^2$
LDA For p > 1
Assumption:
Within each class, the observations are drawn from a same multi-variate normal distribution
The covariance matrix of the density function of multi-variate normal distribution is the same for all classes.
The density:
- $f(x) = \frac{1}{(2\pi)^{p/2}|\sum|^{1/2}}e^{(-\frac{1}{2}(x-\mu)^T\sum^{-1}(x-\mu))}$
- Thus the discriminant is $\delta_k(x) = x^T\sum^{-1}\mu_k-\frac{1}{2}\mu_k^T\sum^{-1}\mu_k+log\pi_k$
Worth noticing:
- Training error will usually lower than test error. Remember to avoid overfitting when the ratio of parameters p to number of samples n is large.
- The threshold should be modified to achieve the demand instead of simply minimizing the overall error rate
Adjusting the threshold
Confusion Matrix(Of the default example)
predicted/true | no | yes | total |
---|---|---|---|
no | 9644 | 252 | 9896 |
yes | 23 | 81 | 104 |
total | 9667 | 333 | 10000 |
overall error rate: $ = \frac{23+252}{10000}$
However: The error rate of wrongly classified those who will default $= \frac{252}{333} = 75.7\%$, which is unacceptable
- The term sensitivity of this example: True defaulters that are identified
- The term specificity of this example: Non-defaulters that are correctly identified
So, we can say that the sum of sensitivity and specificity are the rate that a classifier made right decision.
The low rate of sensitivity actually comes from the Bayes Classifier: we classify an observation to the default class if $Pr(default = Yes|X=x) > 0.5$, however, we can modify this to 0.2 if we want to increase the sensitivity at a cost of the specificity.
ROC Curve
Roc curve is a graphic for simultaneously displaying the two types of errors for all possible thresholds
It only works for two class classifier, the dashed line represents a random classifier. The best classifier should be the point (0,1)
true/predicted | - or null | + or non-null | total |
---|---|---|---|
- or null | TN | FP | N |
+ or non-null | FN | TP | P |
total | N* | P* |
NAME | DEFINITION | SYNONYMS |
---|---|---|
False Pos. rate | FP/N | type-I error, 1-Specificity |
True Pos, rate | TP/P | 1-Type II error , power, sensitivity, recall |
Pos. Pred. value | TP/P* | Precision,1-false discovery proportion |
Neg. Pred. value | TN/N* |
Quadratic Discriminant Analysis
QDA assums that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem that each class has its own covariance matrix.
Thus QDA involves plugging estimates for $\sum_k, \mu_k$ and $\pi_k$
- QDA estimates $K \times \frac{p(p-1)}{2}$ parameters, which is much more flexible than LDA
- LDA may be better if there are fewer observations, but if the training set is large, the variance of the classifier is not a major concern(Use QDA)
Comparison of Classification Methods
As for logistic regression and LDA: Both methods are linear of X, the only difference lies in Logistic Regression use Maximum likelihood to estimate the parameters, while LDA makes an assumption of normal distribution and estimate the mean and variance
If the dicision boundary is highly non-linear, then KNN is out-performed.
- QDA serve as a compromise between the non-parametric KNN method and the linear LDA
- The flexibility level of KNN should be carefully chosen