AI Foundations

Principal Component Analysis (PCA)

By Arpit Tripathi, Founder

Principal component analysis is an unsupervised dimensionality reduction technique that projects data onto a new set of orthogonal axes, the principal components, ordered so that each captures the maximum remaining variance. Keeping the top components compresses data while preserving most of its structure.

What is Principal Component Analysis?

Principal component analysis (PCA) is an unsupervised technique for reducing the number of dimensions in a dataset while keeping as much of its variation as possible. It finds new axes, called principal components, that are linear combinations of the original features. The first principal component points in the direction of greatest variance in the data, the second points in the direction of greatest remaining variance while being orthogonal to the first, and so on. Projecting the data onto the first few components produces a lower-dimensional representation that retains most of the original structure.

PCA is used for compression, visualization (projecting high-dimensional data to two or three dimensions), noise reduction, and as a preprocessing step that decorrelates features before feeding them to another model. Because the components are orthogonal, they are uncorrelated, which can help models that struggle with collinear inputs.

  • PCA finds orthogonal directions (principal components) of maximal variance.
  • Keeping the top components compresses data while preserving most variance.
  • Common uses: visualization, noise reduction, decorrelation, and preprocessing.

How PCA works mathematically

PCA begins by centering the data (subtracting the mean of each feature) and usually standardizing it so features with large scales do not dominate. It then finds the directions of maximum variance through the eigenvectors of the data's covariance matrix, or equivalently through the singular value decomposition (SVD) of the centered data matrix, which is what most modern implementations use for numerical stability.

The eigenvectors of the covariance matrix are the principal components, and each eigenvalue equals the variance captured along its component. Sorting components by eigenvalue from largest to smallest gives the ordering by importance. Projecting the centered data onto the top k eigenvectors yields the reduced k-dimensional representation.

C = (1/(n−1)) · Xᵀ X, C vₖ = λₖ vₖ
For centered data matrix X, the covariance matrix C has eigenvectors v_k (the principal components) and eigenvalues lambda_k (the variance captured along each component).
explained variance ratioₖ = λₖ / Σⱼ λⱼ
The fraction of total variance captured by component k is its eigenvalue divided by the sum of all eigenvalues. Summing the top ratios tells you how much variance the reduced representation retains.
  • Center and typically standardize the features first.
  • Principal components are eigenvectors of the covariance matrix (or from the SVD).
  • Each eigenvalue is the variance captured by its component; sort descending.

Choosing the number of components

The explained variance ratio guides how many components to keep. A common approach is to retain enough components to reach a target cumulative explained variance, for example 95 percent. A scree plot, which graphs eigenvalues in descending order, helps spot an elbow where additional components add little. The right number trades compression against information loss for the task at hand.

Standardization matters because PCA is sensitive to feature scale: a feature measured in large units would otherwise dominate the variance and the first component. When features are on comparable scales or already standardized, PCA reflects genuine structure rather than units.

  • Keep components until cumulative explained variance hits a target like 95 percent.
  • Use a scree plot to find the elbow where extra components add little.
  • Standardize first, because PCA is sensitive to differing feature scales.

Using PCA and its limits

In scikit-learn, PCA is fit on training data and then used to transform both training and new data, with the explained_variance_ratio_ attribute reporting how much variance each component captures. The same fitted transform must be applied at inference time to keep the representation consistent.

PCA is linear, so it cannot capture nonlinear structure; for that, kernel PCA, t-SNE, or UMAP are alternatives. Principal components are also harder to interpret than original features because each is a mixture of all inputs. Finally, PCA is unsupervised, so the directions of highest variance are not guaranteed to be the most useful for a downstream prediction task.

python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=0.95)  # keep enough components for 95% of variance
X_reduced = pca.fit_transform(X_scaled)

print("Original dims:", X_scaled.shape[1])
print("Reduced dims:", X_reduced.shape[1])
print("Cumulative variance kept:", np.sum(pca.explained_variance_ratio_).round(3))
Reducing dimensionality with scikit-learn PCA and inspecting explained variance.
  • Fit PCA on training data, then apply the same transform at inference.
  • PCA is linear; use kernel PCA, t-SNE, or UMAP for nonlinear structure.
  • Components mix all features, so they are less interpretable, and variance is not the same as predictive usefulness.

Key takeaways

  • PCA projects data onto orthogonal principal components ordered by the variance they capture.
  • The first component captures the most variance; each later one captures the most remaining variance while staying orthogonal.
  • Principal components are eigenvectors of the covariance matrix, and eigenvalues give the variance explained.
  • Standardize features first, since PCA is sensitive to scale, and choose component count by cumulative explained variance.
  • PCA is linear and unsupervised, so it may miss nonlinear structure and high-variance directions are not always the most predictive.

Frequently asked questions

PCA reduces the number of dimensions in data by projecting it onto new orthogonal axes called principal components, ordered by how much variance each captures. Keeping the top few components compresses the data while preserving most of its structure.
After centering the data, PCA finds the eigenvectors of the covariance matrix (or uses singular value decomposition). The eigenvector with the largest eigenvalue is the first component, the next largest orthogonal one is the second, and so on by variance captured.
Keep enough to reach a target cumulative explained variance, commonly around 95 percent, or use a scree plot to find the elbow where extra components add little. The choice trades compression against information loss for your task.
PCA is sensitive to feature scale, so a feature measured in large units would dominate the variance and the first component. Standardizing features to comparable scales ensures PCA captures genuine structure rather than differences in units.
PCA is linear, so it misses nonlinear structure that methods like kernel PCA, t-SNE, or UMAP can capture. Components mix all features, making them hard to interpret, and because PCA is unsupervised, high-variance directions are not always the most predictive.
PCA is unsupervised; it uses only the feature values and ignores any target labels. It finds directions of maximum variance in the inputs, which are not guaranteed to be the most useful directions for a downstream prediction task.

Put the idea into practice

MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free