Principal Component Analysis (PCA)

PCA from a Probabilistic Perspective

Let $\mathbf{X} = (X_1, \ldots, X_k)$ be a random vector, and let $ \Sigma = \mathbb{E}(\mathbf{X}\mathbf{X}^T) $ be its covariance matrix, where $\Sigma_{ij} = \text{Cov}(X_i, X_j)$.

For any unit vector $\mathbf{v} = (c_1, \ldots, c_k)$, we can express the variance of the projected data as:

\[\text{Var}(\mathbf{v}^T \mathbf{X}) = \mathbf{v}^T \Sigma \mathbf{v}.\]

One way to interpret this is by viewing the space generated by ${X_1, \ldots, X_k}$ as an inner product space, where the covariance matrix $\Sigma$ defines a symmetric, positive semi-definite bilinear form. Under the basis ${X_1, \ldots, X_k}$, this bilinear form (inner product) is represented by $\Sigma$.

We seek a unit vector $\mathbf{v}$ that maximizes $\text{Var}(\mathbf{v}^T \mathbf{X})$. This means finding the direction in which the projection retains the most variance, which can be interpreted as preserving the most information in the data.

Let $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_k\geq0$ be the eigenvalues of $\Sigma$, and let $\mathbf{v}_1, \ldots, \mathbf{v}_k$ be the corresponding orthonormal eigenvectors. A standard result in linear algebra states that the eigenvector corresponding to the largest eigenvalue, $\mathbf{v}_1$, maximizes $\text{Var}(\mathbf{v}^T \mathbf{X})$.

Thus, $\mathbf{v}_1$ is called the first principal component, representing the direction along which the data has the greatest variance.

PCA from a Linear Algebra Perspective

Consider the $n\times k$ centered data matrix:

\[X = \begin{bmatrix} x_1^{(1)} & \ldots & x_k^{(1)} \\ \vdots & \ddots & \vdots \\ x_1^{(n)} & \ldots & x_k^{(n)} \end{bmatrix}\]

where each row represents an observation, and each column represents a feature. We also require that $\sum_{j=1}^k {x_i}^{(n)} = 0$ for all $i$, We aim to find a unit vector $\mathbf{v} = (c_1, \ldots, c_k)$ such that projecting the data onto $\mathbf{v}$ minimizes the loss of information. This is equivalent to maximizing the length of the projected data, which leads to the optimization problem:

\[\underset{\mathbf{v}}{\text{argmax}} \quad \|X\mathbf{v}\|^2.\]

Expanding the norm, we obtain:

\[\|X\mathbf{v}\|^2 = \mathbf{v}^T (X^T X) \mathbf{v}.\]

The matrix $X^T X$ is symmetric and positive semi-definite. Let $\lambda_1 \geq \cdots \geq \lambda_k \geq 0$ be its eigenvalues, with corresponding orthonormal eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$. Standard results from linear algebra state that the eigenvector $\mathbf{v}_1$ corresponding to the largest eigenvalue maximizes $|X\mathbf{v}|^2$.

Notably, the $(i,j)$-th entry of $X^T X$ is:

\[(X^T X)_{ij} = \sum_{l=1}^{n} x_i^{(l)} x_j^{(l)}.\]

This serves as a scaled estimator for $\text{Cov}(X_i, X_j)$. Thus, the probabilistic and linear algebraic perspectives of PCA are fundamentally the same: both rely on finding the eigenvectors of the covariance matrix (or its empirical counterpart $X^T X$).

For projection onto a $l$-dimensional subspace, we pick the first $l$ principal components $\vec{v}_1,\ldots,\vec{v} _l$, and the projected $n\times l$ data matrix is $X\begin{bmatrix}\vec{v}_1 & \cdots &\vec{v}_k\end{bmatrix}$