Least Angle Regression in Compressed Sensing


Abraham D. Smith

CyberOptics

2015

Linear Fit



Fraction of Variance Explained=0.000

Fraction of Variance Explained=0.066

Fraction of Variance Explained=0.129

Fraction of Variance Explained=0.189

Fraction of Variance Explained=0.244

Fraction of Variance Explained=0.295

Fraction of Variance Explained=0.385

Fraction of Variance Explained=0.466

Fraction of Variance Explained=0.532

Fraction of Variance Explained=0.576

Linear Fit

Initial Data:

Array of Responses \(Y\), shape \(n \times 1\)
Array of Predictors \(X\), shape \(n \times p\)

Goal:

Find a model \(M\), shape \(p \times 1\) to write \( Y \approx XM\)

Quality Measurement:

  • Residual \(R(M)=Y-XM\), shape \(n \times 1\)
  • Residual Sum of Squares \( RSS(M) = R^TR = \sum_k \left(Y_k - X_{k,j}M_{j}\right)^2 = \|R\|_{L_2}^2\)
  • Fraction of Variance Explained \( \mathop{FVE}(M) = \frac{Y^TY - R^TR}{Y^TY} = \frac{ \|Y\|^2_{L_2} - \|R\|^2_{L_2}}{\|Y\|^2_{L_2}}.\)

Best Linear Fit!


Gauss—Markov: \(M_f\) is the unique unbiased minimizer of \(RSS(M)\)
Best fit \(M_f\) should minimize error.
Remember calculus! Find a critical point:
\[0 = RSS'(M_f) = -2 X^T(Y-XM_f) = -2X^TY + 2X^TXM_f \]
so the best-fitting model \(M_f\) solves a linear equation \[ \left(X^TX\right)\,M_f = X^T Y\] \[M_f = \left(X^TX\right)^{-1}X^TY\]

Best Linear Fit?

\( y \approx 25.0012\,x_1 + 0.006\,x_2 + 8.992\,x_3 + \cdots + 2.036\,x_k + \cdots -387.345\,x_{10000}. \)
\( y \approx 25.0006\,x_1 \phantom{~ 0.006\,x_2 + }\, + 8.753\,x_3 + \cdots + 2.032\,x_k + \cdots -387.355\,x_{10000}. \)
\( y \approx 24.0882\,x_1 \phantom{~ 0.006\,x_2 + }\, + 8.733\,x_3 + \cdots \phantom{+ ~ 2.032\,x_k + }\, \cdots -388.392\,x_{10000}. \)
\( y \approx 22.0006\,x_1 \phantom{~ 0.006\,x_2 + \, + 8.733\,x_3 + \cdots + ~ 2.032\,x_k + \, \cdots} -383.392\,x_{10000}. \)
\( y \approx \phantom{22.0006\,x_1 ~ 0.006\,x_2 + \, + 8.733\,x_3 + \cdots + ~ 2.032\,x_k + \, \cdots} -360.341\,x_{10000}. \)
\( y \approx 0. \phantom{2.0006\,x_1 ~ 0.006\,x_2 + \, + 8.733\,x_3 + \cdots + ~ 2.032\,x_k + \, \cdots -360.341\,x_{10000}} \)
Practically, we must balance accuracy against simplicity.

New Goal

Minimize \(RSS(M)\) among all \(M\) of a given complexity.
How to measure complexity? Need a norm.
\(\|v\|_{L_2}\) \(\|v\|_{L_1}\) \(\|v\|_{L_0}\)
\(\sqrt{|v_1|^2 + \cdots + |v_n|^2}\) \(|v_1| + \cdots + |v_n| \) \(\text{# of nonzero terms} \)
(Curse of Dimensionality means \(L_1\) is almost \(L_0\)!)

Aside: The curse of dimensionality

if \(n=2\), then \( (0.90)^n = 0.81 \)
if \(n=3\), then \( (0.90)^n =0.729 \)
if \(n=10000\), then \( (0.90)^n =2.66\times 10^{-458} \)

Lasso LARS

Fix a budget of \(\|M\|_{L_1}\), then Minimize \(RSS(M)\).
\( E_t(M) = \frac12 RSS(M) + t \|M\|_{L_1} \)

LARS

set model = [0,0,...,0]
set residual = Y
set active_vars = { }
while ( active_vars is not everything ) {
  set x[i] = variable best-correlated with residual
  put i in active_vars
  set Mi = best-fit of active_vars with residual
  set t = proportion of quality versus inactives
  set M = M + t*Mi
  set residual = Y - X.M
  break if RSS(M) is good enough
}
return M

Coefficients of model \(M(t)\) versus \(t\). \(M_{4.5} = M_f\).


Fraction of Variance Explained=0.000

Fraction of Variance Explained=0.066

Fraction of Variance Explained=0.129

Fraction of Variance Explained=0.189

Fraction of Variance Explained=0.244

Fraction of Variance Explained=0.295

Fraction of Variance Explained=0.385

Fraction of Variance Explained=0.466

Fraction of Variance Explained=0.532

Fraction of Variance Explained=0.576

Compressed Sensing

A lifelong problem — signal reconstruction from incomplete data.
\[ f \overset{\mathcal{F}}\longrightarrow \hat{f} \overset{\text{sensors}}\longrightarrow \hat{f}|_\Omega \overset{\mathcal{F}^{-1}}{\longrightarrow}f?? \]
original
0% loss
10% loss
20% loss
30% loss
40% loss
50% loss
60% loss
70% loss
80% loss
90% loss
total loss
\[ f \overset{\mathcal{F}}\longrightarrow \hat{f} \overset{\text{sensors}}\longrightarrow \hat{f}|_\Omega \overset{\mathcal{F}^{-1}}{\longrightarrow}f?? \] Step 1 is physics. Step 2 in engineering. Step 3 is mathematics.

Initial Data:

partial freq data \(\hat{f}_\Omega\), shape \(n \times 1\)
partial Transform \(\mathcal{F}_\Omega\), shape \(n \times p\)

Goal:

Find \(g\) with \(\hat{f}_\Omega \approx \mathcal{F} g\).

Optimize

\( \min \| g\|_{L_p} \) s.t. \(\hat{g}_\Omega=\hat{f}_\Omega\)

Initial Data:

response data \(Y\), shape \(n \times 1\)
predictor data \(X\), shape \(n \times p\)

Goal:

Find \(M\) with \(Y \approx XM\).

Optimize

\( \min RSS(M) \) s.t. \(\|M\|_{L_p}=\text{const}\)

Lagrangian

\[ E_t(g) = t \|\hat{f}_\Omega- \mathcal{F}_\Omega g\|^2_{L_2} +\|g\|_{L_1} \]
\[ E_t(M) = \textstyle{\frac12}\|Y-XM\|^2_{L_2} + t \|M\|_{L_1}\]
Look familiar? LARS applies to either!

When does it work?

  • Discrete signals of length \(p\) comprised of \(K\) spikes
  • Guaranteed with good choice of \(\Omega\) with \( | \Omega| \gtrsim 2\cdot K \)
  • Highly likely, asymptotically, for random \(\Omega\) with \( | \Omega| \gtrsim C\cdot K \cdot\log(p)\)
  • (Shockingly better than Nyquist rate!)

How fast does it work?

  • Computing complete trajectory is the same order as best linear fit!
  • Can stop early!

Sample image from http://statweb.stanford.edu/~candes/papers/ExactRecovery.pdf

Thanks!

Abraham D. Smith

References
  • Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information, Candes, Romberg, Tao.
  • Elements of Statistical Learning, Hastie, Tibshirani, Friedman.