Linear Regression
Simple Linear Regression
Given data points $(x_1, y_1), \ldots, (x_n, y_n)$, We need to find a line: $Y = \hat{\beta}_0 + \hat{\beta}_1 X$ that best fit the data.
We define the following terminologies:
- Residual: $e_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$
- Residual sum of squares (RSS): $RSS = \sum_{i=1}^n e_i^2.$
- Mean squared error (MSE): $MSE = \frac{1}{n} \sum_{i=1}^n e_i^2 = \frac{1}{n}RSS$
We need to find $\hat{\beta}_0, \hat{\beta}_1$ that minimize the MSE.
Least Squares Estimators
The above problem has an analytic solution. The least squares estimates for $\beta_0$ and $\beta_1$ are:
\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x},\]where $\bar{x}$ and $\bar{y}$ are the means of $x_i$ and $y_i$.
Model Interpretation
Assume the true relationship is:
\[Y = \beta_0 + \beta_1 X + \epsilon,\]where $\epsilon$ is random noise with mean zero, independent of $X$. The least squares estimators $\hat{\beta}_0$ and $\hat{\beta}_1$ are unbiased estimators for $\beta_0,\beta_1$. It means $\mathbb{E}[\hat{\beta} _j] = \beta_j$.
Standard Errors
The standard errors quantify the variability of the estimators:
\[SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, \quad SE(\hat{\beta}_0)^2 = \sigma^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right),\]where $\sigma^2$ is the variance of the error $\epsilon$. In practice, $\sigma$ is unknown and estimated by the residual standard error (RSE): $RSE = \sqrt{\frac{RSS}{n-2}}$.
Confidence Intervals and Hypothesis Testing
- The 95% confidence intervals for $\beta_0$ and $\beta_1$ are:
- To test the null hypothesis $H_0: \beta_1 = 0$ (no relationship between $X$ and $Y$), use the t-statistic:
Model Assessment
Residual Standard Error (RSE)
RSE is an unbiased estimator for the standard deviation of the error $\epsilon$, representing the average deviation of observed values from the regression line. It is calculated as:
\[RSE = \sqrt{\frac{RSS}{n-2}}.\]While RSE measures the model’s lack of fit, its interpretation depends on the scale of $Y$.
$R^2$ Statistic
The $R^2$ statistic measures the proportion of variance in $Y$ explained by $X$:
\[R^2 = 1 - \frac{RSS}{TSS},\]where $TSS = \sum_{i=1}^n (y_i - \bar{y})^2$ is the total sum of squares. It satisfies:
- $0 \leq R^2 \leq 1$, with higher values indicating better fit.
- $R^2 \approx 1$ suggests that most variability in $Y$ is explained by $X$.
$R^2$ also estimates $1 - \frac{\text{Var}(\epsilon)}{\text{Var}(Y)}$, further reinforcing its interpretation as the proportion of variance explained.
Multilinear Regression
Suppose that we have predictors $X_1, \ldots, X_p$. Then, a multilinear regression model is:
\[Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon.\]Remarks:
- In simple linear regression of $Y$ with $X_1$, we might find $\hat{\beta} _1 \neq 0$ (indicating high marginal dependence). However, in multilinear regression, $\hat{\beta}_1 = 0$ could occur due to correlations between $X_1$ and other predictors.
- The RSE for multilinear regression is computed as:
Collinearity of Predictors and Ridge Regression
When predictors are correlated, e.g., $X_1$ and $X_2$, the data matrix $X$ becomes nearly singular. In this case:
- The least square estimators $\hat{\beta}_1$ and $\hat{\beta}_2$ become highly sensitive to the observed data.
- The estimators will have large standard errors, leading to a lack of robustness in the model.
Solution: Ridge Regression. By adding $\alpha |\beta|_2^2$ (a scaled $L^2$ norm of the coefficients) to the loss function, the model penalizes large coefficients, reducing variance and increasing stability.
Effect of Doubling the Data
If the data is accidentally doubled, the regression coefficients remain unchanged. However:
- The residuals $\epsilon_i$ become correlated.
- With $2n$ samples, the standard errors of the coefficients are underestimated by a factor of $\sqrt{2}$, leading to falsely narrowed confidence intervals.
In general, when residuals are correlated, the computed standard errors underestimate their true values, creating a false sense of confidence in the model.