Principal Component Analysis
ns neandersolve.pca
(:require
(
[neandersolve:as desc]
[descriptive :as eigen]]
[eigen :refer [with-release]]
[uncomplicate.commons.core :refer [fmap!]]
[uncomplicate.fluokitten.core
[uncomplicate.neanderthal:refer [axpy col copy entry entry! mm mrows mv! ncols raw rk! scal!
[core
sum trans vctr]]:refer [fge]]
[native :refer [sqrt]]
[math :refer [rand-normal! rand-uniform!]]])) [random
High-dimensional data often contains redundant or correlated features. While each feature may carry information, the true patterns often lie in lower-dimensional subspaces. Principal Component Analysis (PCA) provides a mathematical framework for discovering these intrinsic patterns by transforming data into a new coordinate system aligned with directions of maximum variance.
Mathematical Foundation
For a dataset with \(n\) observations and \(p\) features, represented as an \(n \times p\) matrix \(\mathbf{X}\), PCA seeks a transformation that reveals the underlying structure. This transformation comes from the eigendecomposition of the covariance matrix:
\[\mathbf{C} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X}\]
The eigenvectors of \(\mathbf{C}\) form the principal components, while their corresponding eigenvalues indicate the amount of variance explained by each component.
Data Preprocessing
Before computing principal components, we must ensure our data is properly centered and optionally scaled. This preprocessing step is crucial for PCA’s effectiveness:
defn center-data
("Centers the data matrix by subtracting column means.
Returns [centered-data column-means]."
[X]let [n (mrows X)
(0.0)
means (entry! (vctr X (ncols X))
centered (copy X)1.0)]
ones (entry! (vctr X n) ;; Compute means efficiently using matrix-vector multiplication
/ 1.0 n) (trans X) ones 0.0 means)
(mv! (;; Center using rank-1 update with outer product of ones and means
1.0 ones means centered)
(rk! - [centered means]))
The centering operation ensures that each feature has zero mean, removing location effects that could bias our variance calculations. For features measured on different scales, we also provide scaling to unit variance:
defn scale-data
("Scales the data matrix to unit variance.
Returns [scaled-data column-stds]."
[X]let [p (ncols X)
(0.0)
stds (entry! (vctr X p)
scaled (copy X)]loop [j 0]
(when (< j p)
(let [col-std (sqrt (desc/variance (col X j)))]
(
(entry! stds j col-std)/ 1.0 col-std) (col scaled j))
(scal! (recur (inc j)))))
( [scaled stds]))
Covariance Computation
The covariance matrix captures the relationships between features. For centered data \(\mathbf{X}\), we compute it efficiently through matrix multiplication:
\[\mathbf{C} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X}\]
defn compute-covariance
("Computes the covariance matrix of centered data.
Each row represents an observation, each column a variable."
[X-centered]let [n (mrows X-centered)
(
covar (mm (trans X-centered) X-centered)]/ 1.0 (dec n)) covar))) (scal! (
Model Fitting
The core PCA algorithm combines preprocessing, covariance computation, and eigendecomposition into a unified workflow. This process reveals the principal components and their relative importance in explaining data variance:
- Optionally center the data: \(\mathbf{X}_c = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^T\)
- Optionally scale to unit variance: \(\mathbf{X}_s = \mathbf{X}_c\mathbf{D}^{-1}\)
- Compute covariance matrix: \(\mathbf{C} = \frac{1}{n-1}\mathbf{X}_s^T\mathbf{X}_s\)
- Perform eigendecomposition: \(\mathbf{C} = \Phi\mathbf{\Lambda}\Phi^T\)
defn pca-fit
("Fits PCA model to data matrix X.
Returns map containing principal components, explained variance, etc."
([X]true true))
(pca-fit X
([X center?]false))
(pca-fit X center?
([X center? scale?]let [[X-processed means] (if center?
(
(center-data X)0.0)])
[X (entry! (vctr X (ncols X)) if scale?
[X-final stds] (
(scale-data X-processed)1.0)])
[X-processed (entry! (vctr X-processed (ncols X-processed))
cov-matrix (compute-covariance X-final)
[eigenvals eigenvecs] (eigen/eigendecomposition cov-matrix)
total-var (sum eigenvals)/ % total-var) (copy eigenvals))]
explained-var-ratio (fmap! #(:components eigenvecs
{:explained_variance eigenvals
:explained_variance_ratio explained-var-ratio
:means means
:scale stds})))
Data Transformation
Once we have fitted a PCA model, we can transform new data into the principal component space. This transformation involves:
- Centering: \(\mathbf{X}_c = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^T\)
- Scaling: \(\mathbf{X}_s = \mathbf{X}_c\mathbf{D}^{-1}\)
- Projection: \(\mathbf{X}_{pca} = \mathbf{X}_s\Phi_k\)
where \(\Phi_k\) contains the first \(k\) principal components.
defn transform
("Transforms data using fitted PCA model.
Optional n-components parameter for dimensionality reduction."
([X pca-model]:components pca-model))))
(transform X pca-model (ncols (
([X pca-model n-components]let [means (:means pca-model)
(:scale pca-model)
scale (;; Center the data
if (some? means)
X-centered (let [means-mat (raw X)]
(loop [i 0]
(when (< i (mrows X))
(loop [j 0]
(when (< j (ncols X))
(
(entry! means-mat i j (entry means j))recur (inc j))))
(recur (inc i))))
(1.0 means-mat X))
(axpy -
X);; Scale the centered data
if (some? scale)
X-scaled (let [scaled (copy X-centered)]
(loop [j 0]
(when (< j (ncols scaled))
(/ 1.0 (entry scale j)) (col scaled j))
(scal! (recur (inc j))))
(
scaled)
X-centered);; Select components for dimensionality reduction
let [full-components (:components pca-model)
components (
selected (raw (mm (trans X) X))]loop [i 0]
(when (< i (mrows full-components))
(loop [j 0]
(when (< j n-components)
(
(entry! selected i j (entry full-components i j))recur (inc j))))
(recur (inc i))))
(
selected)];; Transform the data using the principal components
(mm X-scaled components))))
2 3 [1 2 3 4 5 6]) (pca-fit (fge 2 3 [1 2 3 4 5 6]))) (transform (fge
float, mxn:2x3, layout:column]
#RealGEMatrix[
▥ ↓ ↓ ↓ ┓ 1.22 0.00 0.00
→ 1.22 -0.00 -0.00
→ -
┗ ┛
Inverse Transformation
To reconstruct data from its principal component representation, we reverse the transformation process:
- Back-projection: \(\mathbf{X}_s = \mathbf{X}_{pca}\Phi_k^T\)
- Unscaling: \(\mathbf{X}_c = \mathbf{X}_s\mathbf{D}\)
- Uncentering: \(\mathbf{X} = \mathbf{X}_c + \mathbf{1}\boldsymbol{\mu}^T\)
defn inverse-transform
("Reconstructs original data from transformed data."
[X-transformed pca-model]let [;; Project back to original feature space
(let [full-components (:components pca-model)
components (
selected (raw (mm (trans X-transformed) X-transformed))]loop [i 0]
(when (< i (mrows full-components))
(loop [j 0]
(when (< j (ncols X-transformed))
(
(entry! selected i j (entry full-components i j))recur (inc j))))
(recur (inc i))))
(
selected)
X-scaled (mm X-transformed (trans components));; Unscale the data
if (some? (:scale pca-model))
X-unscaled (let [scaled (copy X-scaled)]
(loop [j 0]
(when (< j (ncols scaled))
(:scale pca-model) j) (col scaled j))
(scal! (entry (recur (inc j))))
(
scaled)
X-scaled);; Uncenter the data
if (some? (:means pca-model))
result (let [means-mat (raw X-unscaled)]
(loop [i 0]
(when (< i (mrows X-unscaled))
(loop [j 0]
(when (< j (ncols X-unscaled))
(:means pca-model) j))
(entry! means-mat i j (entry (recur (inc j))))
(recur (inc i))))
(1.0 means-mat X-unscaled))
(axpy
X-unscaled)] result))
Utility Functions
Explained Variance Analysis
The explained variance ratio helps determine the optimal number of components to retain. The cumulative explained variance shows how much of the total variance is captured by the first \(k\) components:
defn explained-variance-cumsum
("Computes cumulative explained variance ratio."
[pca-model]+ (:explained_variance_ratio pca-model))) (reductions
defn n-components-for-variance
("Determines number of components needed to explain desired variance ratio."
[pca-model target-variance]let [cumsum (explained-variance-cumsum pca-model)]
(inc (count (take-while #(< % target-variance) cumsum))))) (
Example Usage
Transform a simple 2x3 matrix:
2 3 [1 2 3 4 5 6])
(transform (fge 2 3 [1 2 3 4 5 6]))) (pca-fit (fge
float, mxn:2x3, layout:column]
#RealGEMatrix[
▥ ↓ ↓ ↓ ┓ 1.22 0.00 0.00
→ 1.22 -0.00 -0.00
→ -
┗ ┛
Reconstruct the original data:
2 3 [1.22 -1.22 0 0 0 0 0])
(inverse-transform (fge 2 3 [1 2 3 4 5 6]))) (pca-fit (fge
float, mxn:2x3, layout:column]
#RealGEMatrix[
▥ ↓ ↓ ↓ ┓ 1.00 3.00 5.00
→ 2.00 4.00 6.00
→
┗ ┛
source: src/neandersolve/pca.clj