Principal components analysis
Why change dimensions?
Working with a large number of variables can pose several practical and theoretical problems:
Complicated visualization: it is impossible to visually represent data beyond 3 dimensions.
Difficult class separation: in classification problems, the separation between groups may be hidden in a combination of variables rather than in the variables taken individually.
High computational cost: complex models can become difficult to adjust when the number of variables is large.
Strong correlations: redundant variables make models unstable or difficult to interpret.
The natural question to ask is therefore: can we reduce the dimension of the dataset without losing too much information?
Reducing the size does not simply mean removing variables. Doing so could result in the loss of information that may be useful to the model. A better approach is to construct new variables, obtained as linear combinations of the initial variables, which summarize the essential information in the dataset. One possible method for doing this is Principal Component Analysis (PCA).
Principal Component Analysis
PCA is an unsupervised method (without variables to explain) that reduces the dimension of a dataset while retaining as much information as possible. This method is used when there are \(n\) observations of \(p\) continuous numerical variables with \(p\) too “large” to allow for effective modeling or visualization. The method was introduced by Hotelling in Analysis of Complex Statistical Data in 1933.
Mathematical formulation
Let \(X = \left( X_{1}, \dots, X_p \right)^{\top}\) be a random vector composed of \(p\) variables, centered and having variance-covariance matrix \(\Sigma\). Let \(\alpha_{1} = \left( \alpha_{11}, \dots, \alpha_{1p} \right)^{\top}\) be a vector of coefficients. We are looking for a linear combination
\[Y_{1} = \alpha_{1}^{\top} X = \sum_{k = 1}^{p} \alpha_{1k}X_k,\]
such that the variance of \(Y_{1}\) is maximized. The idea is simple: we want to combine \(p\) variables into a single one, but while “capturing” as much of the variability as possible.
First, we must add a constraint on \(\alpha_{1}\), since otherwise we would only have to take \(\alpha_{1k} = \pm \infty\) and we would have \(\mathrm{Var}(Y_{1}) = +\infty\), which is definitely maximal! We therefore constrain \(\alpha_{1}\) so that it has a norm equal to \(1\).
This amounts to calculating: \[\max_{\alpha_1^\top \alpha_1 = 1} \mathrm{Var}(Y_1) = \max_{\alpha_1^\top \alpha_1 = 1} \alpha_1^\top \Sigma \alpha_{1}.\]
This problem is solved using Lagrange multipliers. It leads to the equation \[\Sigma \alpha_1 = \lambda_{1} \alpha_{1},\] where \(\lambda_{1}\) is the largest eigenvalue of \(\Sigma\) and \(\alpha_{1}\) is the associated eigenvector.
This defines the first principal component. The following components are constructed by imposing orthogonality conditions (linear independence) with the previous ones, which amounts to finding the following eigenvectors: \[\Sigma \alpha_k = \lambda_k \alpha_k, \quad \text{with}\quad \lambda_{1} \geq \lambda_2 \geq \dots \lambda_p.\] The principal components are therefore given by \[Y_k = \alpha_k^\top X, \quad\text{with } \alpha_k \text{ being the eigenvector associated with } \lambda_k.\]
It is possible to have a more compact representation of PCA using matrices. Let \(A = \left( \alpha_{1}, \dots, \alpha_p \right) \in \mathbb{R}^{p \times p}\) be the matrix whose columns are the eigenvectors. We have \(Y = AX\) and the covariance of the principal components is written as \[\mathrm{Var}(Y) = A^\top \Sigma A.\]
An overall measure of the variation present in the data is given by the trace of the matrix \(\Sigma\): \[\text{tr}(\Sigma) = \text{tr}(\Lambda) = \sum_{i = 1}^{p} \lambda_i = \sum_{k = 1}^{p} \mathrm{Var}(Y_k).\]
The proportion of variation explained by the principal component \(Y_k\) is therefore given by the ratio between the eigenvalue \(k\) and the sum of the eigenvalues: \[\frac{\lambda_k}{\lambda_{1} + \cdots + \lambda_p} = \frac{\mathrm{Var}(Y_k)} {\text{tr}(\Sigma)}.\]
Similarly, the first \(m\) components explain \[100\% \times \frac{\sum_{k = 1}^{m} \lambda_k}{\sum_{k = 1}^{p} \lambda_k} = 100\% \times \frac{\sum_{k = 1}^{m} \mathrm{Var}(Y_k)} {\sum_{k = 1}^{p} \mathrm{Var}(Y_k)}\] of the variability in the variables.
Practice of PCA
Estimation of the variance-covariance matrix
In practice, the variance-covariance matrix \(\Sigma\) is unknown. To perform PCA, it is necessary to estimate \(\Sigma\) from a random sample \(X_{1}, \dots, X_n\) of independent realizations of \(X\). An (unbiased) estimator of \(\Sigma\) is given by \[\widehat{\Sigma} = \frac{1}{n - 1} \sum_{i = 1}^{n} \left( X_i - \overline{X} \right)\left( X_i - \widehat{X} \right)^{\top},\] where \(\overline{X}\) is the empirical mean of the sample.
The matrix \(\widehat{\Sigma}\) thus obtained is symmetric with real coefficients and therefore diagonalizable. It admits a spectral decomposition of the form \[\widehat{\Sigma} = \widehat{A} \widehat{\Lambda} \widehat{A}^{\top},\] where \(\widehat{A}\) is an orthogonal matrix whose columns are the estimators of the eigenvectors of \(\Sigma\) and \(\widehat{\Lambda}\) is a diagonal matrix containing the estimators of the eigenvalues of \(\Sigma\), assumed to be ordered in descending order.
The principal components are obtained by projecting the observations \(X_i\) onto the basis of eigenvectors: \[Y_i = \widehat{A}^{\top} X_i.\]
Some remarks
Choosing the number of components
A key issue in PCA is choosing how many principal components to retain. Retaining too many does not reduce the dimension, and retaining too few can result in the loss of relevant information. Here are the main rules of thumb used:
80% rule: Retain the minimum number of components necessary to explain at least 80% of the total variance. This threshold is arbitrary, but it often provides a good intuition.
Kaiser’s rule: If PCA is performed using the correlation matrix, then the average eigenvalue is 1. It is recommended to keep only those components with an eigenvalue greater than the average eigenvalue, i.e., \(1\).
Joliffe’s rule: A stricter variant of Kaiser’s rule, which suggests keeping components with an eigenvalue greater than \(0.7\) for PCA performed using the correlation matrix.
Cattell’s rule (or elbow rule): Plot the eigenvalues \(\lambda_k\) as a function of their rank \(k\) and look for a breakpoint in the decline. Beyond this point, additional components explain little additional variance.
These rules are decision-making tools, but the choice of the number of components also depends on the context, the objectives of the analysis, and ease of interpretation.
