title

subtitle

author

job

framework

highlighter

hitheme

widgets

mode

Dimensionality Reduction

CMPUT 466/551

Ping Jin (pjin1@ualberta.ca) and Prof. Russell Greiner (rgreiner@ualberta.ca)

io2012

highlight.js

tomorrow

mathjax

selfcontained

Outline

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
Beyond LASSO

Part 1: Introduction to Dimensionality Reduction

Introduction to Dimensionality Reduction
- General notations
- Motivations
- Feature selection and feature extraction
- Feature Selection
- Feature Extraction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
Beyond LASSO

---&triple w1:50% w2:50%

General Notations

Dataset

*** left

$\mathbf{X}$: columnwise centered $N \times p$ matrix
- $N:$ # samples, $p:$ # features
- An intercept vector $\mathbf{1}$ is added to $\mathbf{X}$, then $\mathbf{X}$ is $N \times (p+1)$ matrix
$\mathbf{y}$: $N \times 1$ vector of labels(classification) or continous values(regression)

*** right

![x](assets/img/xy.png "xy")

*** down

Basic Model

Linear Regression
- Assumption: the regression function $E(Y|X)$ is linear $$f(X) = X^T\beta$$
- $\beta$: $(p+1) \times 1$ vector of coefficients

---&twocolportion w1:53% w2:47%

Motivations

Dimensionality Reduction is about transforming data with high dimensionality into data of much lower dimensionality
- Computational efficiency: less dimensions require less computations
- Accuracy: lower risk of overfitting

*** left

Categories
- Feature Selection:
  - chooses a subset of features from the original feature set
- Feature Extraction:
  - transforms the original features into new ones, linearly or non-linearly
  - e.g. PCA, ICA, etc.

*** right

![fs](assets/img/fs.gif "fs") ![fe](assets/img/fe.gif "fe")

--- &twocolportion w1:50% w2:50%

Feature Selection and Feature Extraction

*** left

Feature Selection

Easier to interpret
Reduces cost: computation, budget, etc.

![fs2](assets/img/fs2.gif "fs2")

*** right

Feature Extraction

More flexible. Feature selection is a special case of linear feature extraction

![fe2](assets/img/fe2.gif "fe2")

Feature Selection and Feature Extraction

Example 1: Prostate Cancer

Response: level of prostate-specific antigen (lpsa).
Initial Feature Set: $${lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45}.$$
Task:
- predict $lpsa$ from measurements of features

Feature selection

Cost: Measuring features cost money
Interpretation: Doctors can see which features are important

Feature Selection and Feature Extraction

Example 2: classification with fMRI data

fMRI data are 4D images, with one dimension being time.
Each image is ~ $50 \times 50 \times 50$(spatial) $\times 200$(times) $= 25M$ dimensions

Feature extraction

Individual voxel-times are not important
Cost is not correlated with #features
Feature extraction offers more flexibility in transforming features, which potentially results in better accuracy

![fMRI2](assets/img/fMRI2.png "fmri2")

Feature Selection Methods

Wrapper Methods

Search the space of feature subsets
Use the cross validation accuracy w.r.t. a specific classifier as the measure of utility for a candidate subset
e.g. see how it works for a feature set {1, 2, 3} in the figure below
- $1,2$, and $3$ represent the $1st$, $2nd$ and $3rd$ feature respectively

![wrapper](assets/img/wrapper.png "wrapper")

Feature Selection Methods

Embedded Methods

exploit the structure of speciﬁc classes of learning models to guide the feature selection process
embedded as part of the model construction process
- e.g. LASSO.

![embedded](assets/img/embedded.png "embedded")

Feature Selection Methods

Filter Methods

use some general rules/criterions to measure the feature selection results independent of the classifiers
e.g. mutual information

![filter](assets/img/filter.png "filter")

Feature Selection

Comparison

	Wrapper	Filter	Embedded
Computational Speed	Low	High	Mid
Chance of Overfitting	High	Low	Mid
Classifier-Independent	No	Yes	No

Wrapper methods have the strongest learning/representation capability among the three
- often fit training dataset better than the other two
- prone to overfitting for small datasets
- require more data to reliably get a near-optimal approximation.

--- &twocolportion w1:50% w2:48%

Feature Extraction

*** left

Principle Components Analysis

A graphical explanation
- Each data sample has three features
- Original features are transformed into new ones
- Often use only the new features with largest variance
Example
- For fMRI images, we usually have millions of dimensions. PCA can project the data from millions of dimensions to only thousands of dimensions, or even less
Other feature extraction methods: ICA, Kernel PCA , etc..

*** right

![alt text](assets/img/pca.png "Principle component analysis") ![alt text](assets/img/pca_var.png "Principle component analysis")

Part 2: Linear Regression and Least Squares (Review)

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
- Least Square Fit
- Gauss Markov
- Bias-Variance tradeoff
- Problems
Subset Selection
Shrinkage Methods
Beyond LASSO

--- &twocolportion w1:58% w2:38%

Linear Regression and Least Squares (Review)

*** left

Least Squares Fit

$$ \begin{equation} \begin{split} RSS(\beta) &= (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta)\\ \frac{\partial RSS}{\partial \beta} &= -2 \mathbf{X}^T(\mathbf{y} - \mathbf{X}\beta) = 0 \quad \Rightarrow \quad \hat{\beta}^{ls} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{split} \end{equation} $$

Gauss Markov Theorem

The least squares estimates $\hat{\beta}^{ls}$ of the parameters β have the smallest variance among all linear unbiased estimates.

Question

Is it good to be unbiased?

*** right

Linear Regression and Least Squares (Review)

Bias-Variance tradeoff

$$ \begin{equation} \begin{split} MSE(\hat{\mathbf{y}}) &= E[(\hat{\mathbf{y}} - Y)^2]\\ &= Var(\hat{\mathbf{y}}) + [E[\hat{\mathbf{y}}] - Y]^2 \end{split} \end{equation} $$

where $Y = X^T\beta$. We can trade increase in bias for much less variance.

Problems of Least Squares

Prediction accuracy: unbiased, but higher variance than many biased estimator (leading to higher MSE), overfitting noise and sensitive to outliers
Interpretation: $\hat{\beta}$ involves all of the features. Better to have SIMPLER linear model, that involves only a few features...
Recall that $\hat{\beta}^{ls} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
- $(\mathbf{X}^T\mathbf{X})$ may be not invertible and thus no closed form solution

Part 3: Subset Selection Methods

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection

Best-subset selection
Forward stepwise selection
Forward stagewise selection
Problems

Shrinkage Methods
Beyond LASSO

Subset Selection Methods

Best-subset selection

Best subset regression finds for each $k \in {0, 1, 2, . . . , p}$ the subset of features of size $k$ that gives smallest RSS.
Then cross validation is utilized to choose the best $k$
An efficient algorithm, the leaps and bounds procedure (Furnival and Wilson, 1974), makes this feasible for $p$ as large as 30 or 40.

![best_sub](assets/img/best_sub.png "best_sub")

Subset Selection Methods

Forward-STEPWISE selection

Instead of searching all possible subsets, we can seek a good path through them.

a sequential greedy algorithm.

Forward-Stepwise Selection builds a model sequentially, adding one variable at a time.

Initialization
- Active set $\mathcal{A} = \emptyset$, $\mathbf{r} = \mathbf{y}$, $\beta = 0$
At each step, it
- identifies the best variable (with the highest correlation with the residual error) $$\mathbf{k} = argmax_{j}(|correlation(\mathbf{x}_j, \mathbf{r})|)$$
- $A = A \cup {\mathbf{k}}$
- then updates the least squares fit $\beta$, $\mathbf{r}$ to include all the active variables

--- &twocolportion w1:55% w2:43%

Subset Selection Methods

Forward-STAGEWISE Regression

*** left

Initialize the fit vector $\mathbf{f} = 0$
For each time step
- Compute the correlation vector $$\mathbf{c} = (\mathbf{c}_1, ..\mathbf{c}_p)$$
  - $\mathbf{c}_j$ represents the correlation between $\mathbf{x}_j$ and the residual error
- $k = argmax_{j \in {1,2,..,p}} |\mathbf{c}_j|$
- Coefficients and fit vector are updated $$\mathbf{f} \gets \mathbf{f} + \alpha \cdot sign(\mathbf{c}_k) \mathbf{x}_k$$ $$\beta_k \gets \beta_k + \alpha \cdot sign(\mathbf{c}_k)$$ where $\alpha$ is the learning rate

***right

---&twocolportion w1:40% w2:55%

Subset Selection Methods

Comparison

*** left

Forward-STEPWISE selection:
- algorithm stops in $p$ steps
Forward-STAGEWISE selection:
- is a slow fitting algorithm, at each time step, only $\beta_k$ is updated. Alg can take more than $p$ steps to stop

*** right

![comp1](assets/img/comp1.png "comp")

$N = 300$ Observations, $p = 31$ features
averaged over 50 simulations

Summary of Subset Selection Methods

Advantages w.r.t Least Squares

More interpretable result
More compact model

Disadvantages w.r.t. Continuos Process

It is a discrete process, and thus has high variance and is very sensitive to changes in the dataset
- If the dataset changes a little, the feature selection result may be very different

Part 4: Shrinkage Methods

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
- Ridge Regression
  - Formulations and closed form solution
  - Singular value decomposition
  - Degree of Freedom
- LASSO
Beyond LASSO

Ridge Regression

Least squares with quadratic constraints $$ \begin{equation} \hat{\beta}^{ridge}= argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x}{ij}\beta_j)^2, \quad s.t. \quad \sum{j = 1}^p \beta_j^2 \leq t \end{equation} $$
Its Lagrange form $$ \hat{\beta}^{ridge} = argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2 + \lambda \sum_{j = 1}^p\beta_j^2 $$
The $l_2$-regularization can be viewed as a Gaussian prior on the coefficients, our solution as the posterior means
Solution

$$ \begin{equation} \begin{split} &RSS(\beta) = (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta) + \lambda \beta^T\beta\\ &\partial RSS(\beta)/ \partial \beta = 0 \quad \Rightarrow\quad \hat{\beta}^{ridge} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \end{split} \end{equation} $$

---&triple w1:50% w2:50%

Ridge Regression

Simulation Experiment

***left

$N = 30$
$\mathbf{x}_1 \sim N(0, 1)$
$\beta \sim (U(-0.5,0.5), U(-0.5,0.5))$

***right

$\mathbf{y} = (\mathbf{x}_1, \mathbf{x}_1^2) \times \beta$
$\mathbf{X} = (\mathbf{x}_1, \mathbf{x}^2_1, ..., \mathbf{x}^8_1)$
Dataset avalible: {$\mathbf{X}$, $\mathbf{y}$}

*** down

![lst](assets/img/lst.png "lst")

---&twocolportion w1:50% w2:50%

Ridge Regression

Singular Value Decomposition (SVD)

SVD offers some additional insight into the nature of ridge regression.

*** left

The SVD of $\mathbf{X}$: $$\mathbf{X} = \mathbf{UDV}^T$$
- $\mathbf{U}$: $N \times p$ orthogonal matrix with columns spanning the column space of $\mathbf{X}$.
  - $\mathbf{u}_j$ is the $j$th column of $\mathbf{U}$
- $\mathbf{V}$: $p \times p$ orthogonal matrix with columns spanning the row space of $\mathbf{X}$.
  - $\mathbf{v}_j$ is the $j$th column of $\mathbf{V}$
- $\mathbf{D}$: $p \times p$ diagonal matrix with diagonal entries $d_1 \geq d_2 \geq ... \geq d_p \geq 0$ being the singular values of $\mathbf{X}$

*** right

![svd2](assets/img/svd2.png "svd2") ![svd](assets/img/svd.gif "svd")

---&triple w1:50% w2:50%

Ridge Regression

Singular Value Decomposition (SVD)

*** left

For least squares $$ \begin{equation} \begin{split} \mathbf{X}\hat{\beta}^{ls} &= \mathbf{X(X^TX)^{-1}X^Ty}\ &=\mathbf{UU^Ty} =\sum_{j=1}^p\mathbf{u}_j \mathbf{u}_j^T\mathbf{y} \end{split} \end{equation} $$

*** right

For ridge regression $$ \begin{equation} \begin{split} \mathbf{X}\hat{\beta}^{ridge} &= \mathbf{X(X^TX + \lambda I)^{-1}X^Ty}\ &=\sum_{j=1}^p\mathbf{u}_j\frac{d_j^2}{d_j^2 + \lambda} \mathbf{u}_j^T\mathbf{y} \end{split} \end{equation} $$

*** down

Compared with the solution of least squares, we have an additional shrinkage term $$\frac{d_j^2}{d_j^2 + \lambda},$$ the smaller $d_j$ is and the larger $\lambda$ is, the more shrinkage we have.

---&twocolportion w1:55% w2:45%

Ridge Regression

Singular Value Decomposition (SVD)

*** left

$N = 100$, $p = 10$

*** right

![shrink](assets/img/shrink.png "shrink")

--- &twocolportion w1:50% w2:48%

Ridge Regression

Degree of Freedom

*** left

The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The degree of freedom of ridge estimate is related to $\lambda$, thus defined as $df(\lambda)$.
Computation $$ \begin{equation} \begin{split} df(\lambda) &= tr[\mathbf{X(X^TX + \lambda I)^{-1}X^T}]\ &=\sum_{j=1}^p \frac{d_j^2}{d_j^2 + \lambda} \end{split} \end{equation} $$
[larger $\lambda$] $\rightarrow$ [smaller $df(\lambda)$] $\rightarrow$ [more constrained model]
The red line gives the best $df(\lambda)$ identified from cross validation w.r.t RSS

*** right

![df](assets/img/df.png "df")

Ridge Regression

Advantages

w.r.t. Least Squares
- $(\mathbf{X^TX + \lambda I})$ is always invertible and thus the closed form solution always exist
- Ridge regression controls the complexity with regularization term via $\lambda$, which is less prone to overfitting compared with least squares fit,
- Possibly higher prediction accuracy, as the estimates of ridge regression trade a little bias for less variance
w.r.t. Subset Selection Methods
- Ridge regression is a continuous shrinkage method that has less variance than subset selection methods

Ridge Regression

Disadvantages w.r.t. Subset Selection Methods

Compactness:
- Computational efficiency:
  - though we have a closed form solution, computing matrix inversions takes time and memory
  - it takes longer to predict for future samples with more features
Interpretation
- offers little interpretations

Part 4: Shrinkage Methods - LASSO

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
- Ridge Regression
- LASSO
  - Formulations
  - Comparisons with ridge regression and subset selection
  - Solution of LASSO
  - Viewed as approximation for $l_0$-regularization
Beyond LASSO

LASSO

Linear regression with $l_1$-regularization

Formulations
- Least squares with constraints $$ \begin{equation} \hat{\beta}^{LASSO}= argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2, \quad s.t. \sum_{j = 1}^p |\beta_j| \leq t \end{equation} $$
- Its Lagrange form $$ \hat{\beta}^{LASSO} = argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2 + \lambda \sum_{j = 1}^p|\beta_j| $$
- The $l_1$-regularization can be viewed as a Laplace prior on the coefficients

---&twocol

LASSO

$s = \frac{t}{\sum_{j=1}^p |\hat{\beta}_j|}$, where $\hat{\beta}$ is the least squares estimate
red lines represent the $s$ and $df(\lambda)$ with the best cross validation error

*** left

![LASSO](assets/img/lasso.png "LASSO")

*** right

![df](assets/img/df.png "df")

LASSO

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
- Ridge Regression
- LASSO
  - Formulations
  - Comparisons with ridge regression and subset selection
    - Orthonormal inputs
    - Non-orthonormal inputs
  - Solution of LASSO
  - Viewed as approximation for $l_0$-regularization
Beyond LASSO

LASSO

Comparison

Orthonormal Input $\mathbf{X}$
- Best subset: [Hard thresholding] keeps the top $M$ largest coefficeints of $\hat{\beta}^{ls}$
- Ridge: [Pure shrinkage] does proportional shrinkage of $\hat{\beta}^{ls}$
- LASSO: [Soft thresholding] translates each coefficient of $\hat{\beta}^{ls}$ by $\lambda$ towards 0, truncating at 0

![comp2](assets/img/comp2.png "comp2")

---&twocolportion w1:60% w2:40%

LASSO

Comparison

Non-orthonormal Input $\mathbf{X}$

*** left

![comp3](assets/img/comp3.png "comp3")

*** right

Solid blue area: the constraints
- left: $|\beta_1| + |\beta_1| \leq t$
- right: $\beta_1^2 + \beta_1^2 \leq t^2$
$\hat{\beta}$: least squares fit
want to find the point that is nearest to $\hat{\beta}$ , within blue region

LASSO

Other unit circles for different $p$-norms

![uc](assets/img/unit_circle.png "uc")

	Convex	Smooth	Sparse
$q<1$	No	No	Yes
$q>1$	Yes	Yes	No
$q = 1$	Yes	No	Yes

Here $q = 0$ is the pure variable selection procedure, as it is counting the number of non-zero coefficients.

--- &twocolportion w1:48% w2:48%

LASSO

Regularizations as priors

$|\beta_j|^q$ can be viewed as the log-prior density for $\beta_j$, these three methods below are bayes estimates with different priors

Subset selection: corresponds to $q = 0$
LASSO: corresponds to $q = 1$, Laplace prior, $density = (\frac{1}{\tau})exp(\frac{-|\beta|}{\tau}), \tau = \sigma/\lambda$
Ridge regression: corresponds to $q = 2$, Gaussian Prior, $\beta \sim N(0, \tau \mathbf{I})$, $\lambda = \frac{\sigma^2}{\tau^2}$

*** left

![laplace](assets/img/laplace.png "laplace")

*** right

![gauss](assets/img/gauss.png "gauss")

LASSO

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
- Ridge Regression
- LASSO
  - Formulations
  - Comparisons with ridge regression and subset selection
  - Solution of LASSO
  - Viewed as approximation for $l_0$-regularization
Beyond LASSO

LASSO

Quadratic Programming

Formulation $$ min_{\beta}{ \frac{1}{2}(\mathbf{X}\beta - \mathbf{y})^T (\mathbf{X}\beta - \mathbf{y}) + \lambda |\beta|1} $$ is equivalent to $$ min{w, \xi}{ \frac{1}{2}(\mathbf{X}\beta - \mathbf{y})^T (\mathbf{X}\beta - \mathbf{y}) + \lambda \mathbf{1}^T\xi} $$

$$ \begin{equation} \begin{split} s.t. &\beta_j \leq \xi_j\\ &\beta_j \geq -\xi_j \end{split} \end{equation} $$

Note that QP can only solve LASSO for a given $\lambda$.
- Next: method for solving for all $\lambda$ (LAR)

---&twocolportion w1:44% w2:44%

LAR Algorithm

*** left

Algorithm

Standardized all predictors;
$\mathbf{r}_0 = \mathbf{y} - \bar{\mathbf{y}}$; $\beta = \mathbf{0}$;
$k = argmax_{j} |corr(\mathbf{x}_j, \mathbf{r}_0)|$, $\mathcal{A}_1 = {k}$
For time step $t = 1,2,...min(N-1,p)$
- Move $\beta_{\mathcal{A}_t}$ in the joint least squares direction for $\mathcal{A}_t$, until some other $k \not\in \mathcal{A}_t$ has as much correlation with the current residual
- $\mathcal{A}{t+1} = \mathcal{A}{t} \cup {k}$

*** right

Notations

$\beta$: $p \times 1$ coefficient vector
$\mathcal{A}_t$: active set, the set indices of features included in the model at time step $t$.
- $\bar{\mathcal{A}_t} = {1,2,...,p} - \mathcal{A}_t$
$\beta_{\mathcal{A}_t}$: $|\mathcal{A}_t| \times 1$ vector of coefficients, w.r.t $\mathcal{A}_t$
- Contains the $\beta_j, \quad j \in \mathcal{A}_t$
$\mathbf{X}_{\mathcal{A}_t}$: $[ \mathbf{x}j ]{j\in \mathcal{A}_t}$

---&twocolportion w1:50% w2:48%

LAR - Example

*** left

![lars1](assets/img/lars1.png "lars1")

*** right

Example Setting:
- $N = 2$, $p = 2$
Columnwise-standardized data matrix $\mathbf{X}$
- s.t. $mean{\mathbf{x}_j} = 0$, $std{\mathbf{x}_j} = 1$
- $\rightarrow |\mathbf{x}_1| = |\mathbf{x}_2| = ... = |\mathbf{x}_p|$
- Two standardized column vectors $\mathbf{x}_1$ and $\mathbf{x}_2$ are shown in the left figure
$\mathcal{A}_0 = \emptyset$, which means that we have not chosen any feature yet
$\beta = (0, 0)^T$
The $N\times 1$ fit vector $\mathbf{f}0 = \mathbf{X} \beta{\mathcal{A}_t } = 0$

---&twocolportion w1:55% w2:45%

LAR - Example

*** left

![lars2](assets/img/lars2.png "lars2")

*** right

$k = argmax_{j} |corr(\mathbf{x}_j, \mathbf{r}_0)| = 1$
$\mathcal{A}_1 = \mathcal{A}_0 \cup {1} = {1}$

---&twocolportion w1:55% w2:45%

LAR - Example

*** left

![lars3](assets/img/lars3.png "lars3")

*** right

Explanations

$\mathbf{r}1 = \mathbf{y} - \mathbf{X}{\mathcal{A}1} \beta{\mathcal{A}_1}$ is the residual error at the beginning of time $1$
$\delta_1 = \mathbf{(X^T_{\mathcal{A}1} X{\mathcal{A}1})^{-1}X^T{\mathcal{A}_1}r_1}$ is the least square estimates of the coefficients whose corresponding features in $\mathcal{A}_1 = {1}$ w.r.t. residual error $\mathbf{r}_1$
- $\delta_1$ is the direction that coefficients $\beta_{\mathcal{A}_1}$ changes along
$\mathbf{u}1 = \mathbf{X}{\mathcal{A}_1} \delta_1$
- As $\beta_{\mathcal{A}_1}$ changes along $\delta_1$, the fit $\mathbf{f}$ changes along $\mathbf{u}_1$

---&twocolportion w1:55% w2:45%

LAR - Example

*** left

![lars3](assets/img/lars3.png "lars3")

*** right

Comparison

$\mathbf{r}_1$
- $\delta_1 = \mathbf{(X^T_{\mathcal{A}1} X{\mathcal{A}1})^{-1}X^T{\mathcal{A}_1}r_1}$
- $\mathbf{u}1 = \mathbf{X}{\mathcal{A}_1} \delta_1$
$\mathbf{y}$
- $\hat{\beta} = \mathbf{(X^T X)^{-1}X^Ty}$
- $\hat{\mathbf{y}} = \mathbf{X} \hat{\beta}$

---&twocolportion w1:55% w2:45%

LAR - Example

*** left

![lars4](assets/img/lars4.png "lars4")

*** right

Explanations

As $\beta_{\mathcal{A}_1}$ moves along $\delta_1$, the correlation between the coefficient of the feature in $\mathcal{A}_1 = {1}$ with the residual error decreases
$\mathbf{f}_1$, intialized as $\mathbf{f}_0$, moves along $\mathbf{u}_1$
At last, the correlation between the coefficient of feature $k = 2$ and residual error catches up
$\mathcal{A}_2 = \mathcal{A}_1 \cup {2} = {1, 2}$
Note that the fit $\mathbf{f}_1$ approaches the least squares fit $\mathbf{f}_1^{ls}$, but not reach it

---&twocolportion w1:55% w2:45%

LAR - Example

*** left

![lars5](assets/img/lars5.png "lars5")

*** right

Explanations

The joint least squares direction, which $\beta_{\mathcal{A}2}$ moves along, is $\delta_2 = \mathbf{(X^T{\mathcal{A}2} X{\mathcal{A}2})^{-1}X^T{\mathcal{A}_2}r_2}$
- $\mathbf{r}2 = \mathbf{y} - \mathbf{X}{\mathcal{A}2} \beta{\mathcal{A}_2}$
The direction our fit move along is $\mathbf{u}2 = \mathbf{X}{\mathcal{A}_2} \delta_2$
- Note that $\mathbf{u}_2$ is the bisector of $\mathbf{x}_1$ and $\mathbf{x}_2$
- In general, $\mathbf{u}_t$ is the "bisector" of (has the same angle with) all $\mathbf{x}_j,\quad j \in \mathcal{A}_t$

---&twocolportion w1:55% w2:45%

LAR - Example

*** left

![lars6](assets/img/lars6.png "lars6")

*** right

Explanations

As $\beta_{\mathcal{A}_2}$ moves along $\delta_2$, the fit $\mathbf{f}_2$, initialized as $\mathbf{f}_1$,
- moves along $\mathbf{u}_2$
- approaches $\mathbf{f}_2^{ls}$
As we only have $p = 2$ features, finally
- $\mathbf{f}_2 = \mathbf{f}^{ls}_2$

LAR

More Comments

LAR solves the subset selection problem for all $t, s.t. |\beta| \leq t$
LAR algorithm ends in $min(p, N-1)$ steps

![corr](assets/img/lar_corr.png "cor")

LAR

Result compared with LASSO

Observations

When the blue line coefficient cross zero, LAR and LASSO become different.

![comp4](assets/img/comp4.png "comp4")

LAR

Result compared with LASSO

Modification for LASSO

During the searching procedure, if a non-zero coefficient hits zero, drop this variable from $\mathcal{A}_t$, and recompute the direction $\delta_t$

![comp4](assets/img/comp4.png "comp4")

LAR

Some heuristic analysis

At a certain time point, we know that all $\mathbf{x}_j \in \mathcal{A}$ share the same absolute values of correlations with the residual error. That is $$\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta) = \gamma \cdot s_j, \quad \forall j \ \in \mathcal{A}$$ where $s_j = sign(\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta)) \in {-1,1}$ and $\gamma$ is the common value.
- We also know that $|\mathbf{x_j}(\mathbf{y} - \mathbf{X}\beta)| \leq \gamma, \quad \forall j \not\in \mathcal{A}$
Consider LASSO for a fixed $\lambda$. Let $\mathcal{B}$ be the set of indices of non-zero coefficients, then we differentiate the objective function w.r.t. those coefficients in $\mathcal{B}$ and set the gradient to zero. We have $$\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta) = \lambda \cdot sign(\beta_j), \quad \forall j \in \mathcal{B}$$
They are identical only if $sign(\beta_j)$ matches $s_j$. In $\mathcal{A}$, we allow for the $\beta_j$, where $sign(\beta_j) \neq sign(\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta))$, while this is forbidden in $\mathcal{B}$.

LAR

Some heuristic analysis

LAR requires that $$|\mathbf{x}_k^T(\mathbf{y} - \mathbf{X}\beta)| \leq \gamma, \quad \forall k \not\in \mathcal{A}$$
- $\mathcal{A}$: set of the indices of the features with non-zero coefficients in LAR
- $\gamma = |\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta)|,\quad \forall j \in \mathcal{A}$
For LASSO, The stationary conditions require that $$ |\mathbf{x}_k^T(\mathbf{y} - \mathbf{X}\beta)| \leq \lambda, \quad \forall k \not\in \mathcal{B} $$
- $\mathcal{B}$: set of the indices of the features with non-zero coefficients in LASSO
- $\lambda$: regularization parameter
LAR agrees with LASSO for variables with zero coefficients too.

LASSO

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
- Ridge Regression
- LASSO
  - Formulations
  - Comparisons with ridge regression and subset selection Solution of LASSO
  - Viewed as approximation for $l_0$-regularization
Beyond LASSO

Viewed as approximation for $l_0$-regularization

Pure variable selection

$$ \begin{equation} \hat{\beta}^{ridge}= argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2, \quad s.t. #nonzero \beta_j \leq t \end{equation} $$

Actually $#nonzero \beta_j = |\beta|_0$, where

$$|\beta|0 = lim{q \to 0}(\sum_{j = 1}^p|\beta_j|^q)^{\frac{1}{q}} = card(\quad {\beta_j|\beta_j \neq 0}\quad)$$

![zeronorm](assets/img/zeronorm.png "zeronorm")

Viewed as approximation for $l_0$-regularization

Problem

$l_0$-norm is not convex, which makes it very hard to optimize.

Solutions

LASSO: Approximated objective function ($l_1$-norm), with exact optimization
Subset selection: Exact objective function, with approximated optimization (greedy strategy)

Part 5: Beyond LASSO

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
Beyond LASSO
- Elastic-Net
- Fused LASSO
- Group LASSO
- $l_1-lp$ norm
- Graph-guided LASSO

Beyond LASSO

Problems with LASSO

LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select ALL the relevant varibles in a group
LASSO selects at most $N$ variables, when $p > N$, which may be undesirable when $p >> N$
The performance of Ridge dominates that of LASSO, when $N > p$ and variables are correlated
LASSO does not consider about the prior information of the structure over input or output variables

--- &triple w1:41% w2:55%

Beyond LASSO - Elastic Net

Problems solved by E-Net

LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select ALL the relevant varibles in a group
LASSO selects at most $N$ variables, when $p > N$, which may be undesirable when $p >> N$
The performance of Ridge dominates that of LASSO, when $N > p$ and variables are correlated

Elastic Net

*** left

Penalty Term $$\lambda \sum_{j = 1}^p (\alpha \beta_j^2 + (1-\alpha)|\beta_j|)$$ which is a compromise between ridge regression and LASSO and $\alpha \in [0,1]$.

*** right

![enet](assets/img/enet.png "enet")

--- &triple w1:41% w2:55%

Beyond LASSO - Elastic Net

More Advantages of E-Net

selects variables like LASSO, and shrinks together the coefficients of correlated predictors like ridge.
has considerable computational advantages over the $l_q$ penalties.
- See 18.4 [Elements of Statistical Learning]

Elastic Net

*** left

Penalty Term $$\lambda \sum_{j = 1}^p (\alpha \beta_j^2 + (1-\alpha)|\beta_j|)$$ which is a compromise between ridge regression and LASSO and $\alpha \in [0,1]$.

*** right

![enet](assets/img/enet.png "enet")

Elastic Net - A simple illustration

Two independent “hidden” factors $\mathbf{z}_1$ and $\mathbf{z}_2$ $$\mathbf{z}_1 \sim U(0, 20),\quad \mathbf{z}_2 \sim U(0, 20),$$
Generate the response vector $\mathbf{y} = \mathbf{z}_1 + 0.1\mathbf{z}_2 + N(0,1)$
Suppose the observed features are $$\mathbf{x}_1 = \mathbf{z}_1 + \epsilon_1,\quad \mathbf{x}_2 = -\mathbf{z}_1 + \epsilon_2,\quad \mathbf{x}_3 = \mathbf{z}_1 + \epsilon_3$$ $$\mathbf{x}_4 = \mathbf{z}_2 + \epsilon_4,\quad \mathbf{x}_5 = -\mathbf{z}_2 + \epsilon_5,\quad \mathbf{x}_6 = \mathbf{z}_2 + \epsilon_6$$ where $\epsilon$ is $i.i.d.$ random noise.
Fit the model on data $(\mathbf{X}, \mathbf{y})$
A good model should identify that only $\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3$ are important

Elastic Net - A simple illustration

![enet_LASSO](assets/img/enet_lasso.png "enet_lasso")

Elastic Net - A simple illustration

![enet_LASSO](assets/img/enet_re.png "enet_re")

Beyond LASSO - Fused Lasso

Problems with LASSO

LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select ALL the relevant varibles in a group
LASSO selects at most $N$ variables, when $p > N$, which may be undesirable when $p >> N$
The performance of Ridge dominates that of LASSO, when $N > p$ and variables are correlated
LASSO does not consider about the prior information of the structure over input or output variables

Fused LASSO can solve the $4th$ problem for a specific kind of prior.

Beyond LASSO - Fused LASSO

Fused LASSO

Intuition
- Fused LASSO is designed for problems with features that can be ordered in some meaningful way, where "adjacent features" should have similar importance
- Fused LASSO penalizes the $L_1$-norm of both the coefﬁcients and their successive differences
Example
- Classification with fMRI data: each voxel has about 200 measurements over time. The coefficients for adjacent voxels should be similar
Formulation

$$\hat{\beta} = argmin_{\beta}{|\mathbf{X\beta - y}|2^2}$$ $$s.t. |\beta| \leq s_1 \quad and \quad \sum{j = 2}^p |\beta_j - \beta_{j-1}| \leq s_2$$

Beyond LASSO - Fused LASSO

Fused LASSO

![nb_fs](assets/img/nb_fs.png "nb_fs")

Fused LASSO - Simulation results

![fLASSO](assets/img/flasso.png "fLASSO")

$p = 100$. Black lines are the true coefficients.
(a) Univariate regression coefficients (red), a soft threshold version of them (green)
(b) LASSO solution (red), $s_1 = 35.6,\quad s_2 = \infty$
(c) Fusion estimate, $s_1 = \infty, \quad s_2 = 26$
(d) Fused LASSO, $s_1 = \sum |\beta_j|,\quad s_2 = \sum |\beta_j - \beta_{j-1}|$

Beyond LASSO - Group LASSO

Problems with LASSO

LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select ALL the relevant varibles in a group
LASSO selects at most $N$ variables, when $p > N$, which may be undesirable when $p >> N$
The performance of Ridge dominates that of LASSO, when $N > p$ and variables are correlated
LASSO does not consider about the prior information of the structure over input or output variables

Group LASSO can solve the $1st$ and the $4th$ problem for a specific kind of prior

Differences in the way Group LASSO and Elastic Net solve the $1st$ Problem
- Group LASSO: Prior information about group structures is needed
- Elastic Net: Prior information is not needed and the algorithm detects the group itself

Beyond LASSO - Group LASSO

Group LASSO

Intuition
- Features are divided into $L$ groups
- Features within the same group should share similar coefficients
Example
- Binary dummy variables from one single discrete variable, e.g. $stage_cancer \in {1,2,3}$ can be translated into three binary dummy variables $(stage1, stage2, stage3)$
Formulations $$obj = \left|\mathbf{y} - \sum_{l = 1}^L \mathbf{X}_l \beta_l \right|2^2 + \lambda_1 \sum{l = 1}^L\left|\beta_l\right|_2 + \lambda_2 \left|\beta\right|_1$$

Group LASSO - Simulation Results

Generate $n = 200$ observations with $p = 100$, divided into ten blocks equally
The number of non-zero coefficients in blocks are

![bl](assets/img/blocks.png "block") - The coefficients are either -1 or +1, with the sign being chosen randomly. - The predictors are standard Gaussian with correlation 0.2 within a group and zero otherwise - A Gaussian noise with standard deviation 4.0 was added to each observation

Group LASSO - Simulation Results

![gl](assets/img/gl.png "gl")

Group structures are not discovered by LASSO.

Group LASSO - Simulation Results

![gl2](assets/img/gl2.png "gl2")

Beyond LASSO - $l_1-l_p$ penalization

Problems with LASSO

LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select ALL the relevant varibles in a group
LASSO selects at most $N$ variables, when $p > N$, which may be undesirable when $p >> N$
The performance of Ridge dominates that of LASSO, when $N > p$ and variables are correlated
LASSO does not consider about the prior information of the structure over input or output variables

$l_1$-$l_p$ penalization solves the $4th$ problem by dealing with prior information of structures over output variables

Beyond LASSO - $l_1$-$l_p$ penalization

$l_1$-$l_p$ penalization

Applies to multi-task learning, where the goal is to estimate predictive models for several related tasks.
Examples
- Example 1: recognize speech of different speakers, or handwriting of different writers,
- Example 2: learn to control a robot for grasping different objects
- Example 3:learn to control a robot for driving in different landscapes
Assumptions about the tasks
- sufficiently different that learning a specific model for each task results in improved performance
- similar enough that they share some common underlying representation that should make simul- taneous learning beneficial.
- different tasks share a subset of relevant features selected from a large common space of features.

Beyond LASSO - $l_1$-$l_p$ penalization

$l_1$-$l_p$ penalization

Formulation
- $\mathbf{X}_l$: $N \times p$ input matrix for task $l = 1..L$
  - $L$ is the total number of tasks
- $\beta$: $p \times L$ coefficient matrix
- $\mathbf{y}$: $N \times L$ output matrix
- objective function $$obj = \sum_{l= 1}^L J(\beta_{:l}, \mathbf{X}l, \mathbf{y}{:l}) + \lambda \sum_{j = 1}^p |\beta_{j:}|2$$ where $J$ is some loss function and $\sum{j = 1}^p |\beta_{:j}|2$ is the $l_1$ norm of vector $(|\beta{:1}|2, |\beta{:2}|2, ..., |\beta{:p}|_2)$.

Beyond LASSO - $l_1-l_p$ penalization

$l_1-l_p$ penalization -Coefficient matrix

![l1lp](assets/img/l1lp.png "l1lp")

$l_1-l_p$ penalization - Experiment Result

Dataset: handwritten words dataset collected by Rob Kassel
- Contains writings from more than 180 different writers.
- For each writer, the number of each letter we have is between 4 and 30
- The letters are originally represented as $8 \times 16$
Task: build binary classiers that discriminate between pairs of letters. Specically concentrat on the pairs of letters that are the most difficult to distinguish when written by hand.
Experiment: learned classications of 9 pairs of letters for 40 different writers

![write](assets/img/write.png "write")

$l_1-l_p$ penalization - Experiment Result

Candidate methods
- Pool $l_1$: a classifier is trained on all data regardless of writers
- Independent $l_1$ regularization: For each writer, a classifier is trained
- $l_1/l_1$-regularization: $$obj = \sum_{l= 1}^L J(\beta_{:l}, \mathbf{X}l, \mathbf{y}{:l}) + \lambda \sum_{l = 1}^L |\beta_{:l}|_1$$
  - purely adding up the objective functions with $l_1$ regularization of $L$ tasks
- $l_1/l_2$-regularization: $$obj = \sum_{l= 1}^L J(\beta_{:l}, \mathbf{X}l, \mathbf{y}{:l}) + \lambda \sum_{j = 1}^p |\beta_{j:}|_2$$

$l_1-l_p$ penalization - Experiment Result

![l1lp_re](assets/img/l1lp_re.png "l1lp_re")

Within a cell, the first row contains results for feature selection, the second row uses random projections to obtain a common subspace (details omitted, see paper: Multi-task feature selection)
Bold: best of $l_1/l_2$,$l_1/l_1$, $sp.l_1$ or pooled $l_1$, Boxed : best of cell

Beyond LASSO - Graph-Guided Fused Lasso

Problems with LASSO

LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select ALL the relevant varibles in a group
LASSO selects at most $N$ variables, when $p > N$, which may be undesirable when $p >> N$
The performance of Ridge dominates that of LASSO, when $N > p$ and variables are correlated
LASSO does not consider about the prior information of the structure over input or output variables

Graph-Guided Fused LASSO (GFLASSO) solves the $4th$ problem by dealing with prior information of structures over output variables

More general than $l_1/l_p$, as abitrary graphical structures over the output variables can be encoded as priors in GFLASSO

Beyond LASSO - GFLASSO

Graph-Guided Fused LASSO

Example

![gfLASSO](assets/img/gflasso.png "gfLASSO") - Formulation Graph-Guided LASSO applies to multi-task settings $$obj = \sum_{l= 1}^L loss(\beta_{:l}, \mathbf{X}_l, \mathbf{y}_{:l}) + \lambda \|\beta\|_1+\gamma \sum_{(a,b)\in E}^p \tau(r_{ab}) \sum_{j = 1}^p |\beta_{ja} - sign(r_{a,b})\beta_{jb}|$$ where $r_{a,b} \in \mathbb{R}$ denotes the weight of the edge and $\tau(r)$ can be any user specified positive monotonically increasing function of $|r|$ - e.g. $\tau(r) = |r|$.

---&twocol

Beyond LASSO - GFLASSO

GFLASSO

![gfLASSO_re](assets/img/gflasso_re.png "gfLASSO_re")

*** left

(a) The true regression coefficients
(c) $l_1/l_2$-regularized multi-task regression

*** right

(b) LASSO
(d) GFLASSO

Summary

Outline

Introduction to Dimensionality Reduction
Linear Regression and Least Squares (Review)
Subset Selection
Shrinkage Methods
Beyond LASSO

Summary

Feature selection vs feature extraction
- Feature selection: can save cost, be interpreted
- Feature extraction: more general, often leads to better performance
Linear models: Least Squares, Subset Selection, Ridge, LASSO:
- Least Squares is unbiased, but can have high variance (as includes all features)
- Ridge ($l_2$ regularization): constrains parameter values, to reduce variance
- Subset Selection, LASSO: finds subset of features (to reduce variance)
- LASSO uses $l_1$ regularization
LAR is like LASSO ( ($l_1$ regularization), but
- Their behaviors become different, when an coefficient hits zero
- Modification: drops the feature, when its coefficient hits zero

Summary II

QP solves LASSO for a single $\lambda$, while LAR can solve LASSO for all $\lambda$
Bayesian prior interpretation for Subset Selection, Ridge and LASSO
Beyond LASSO (all use L1 regularization)
- Elastic Net -- both L1 and L2
- fused LASSO: coefficients of adjacent features are similar
- group LASSO: feautures share similar coefficients within groups
- $l_1/l_2$: similarities in multi-task
- GFlasso: incorporates structure on output variables

Sparse Models

Thank You!

Reference

Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning [p7, p15, p16, p18, p19, p21-22, p26-27, p29-30, p33, p35-37, p42-p43, p50-p54, p56, p59]
Temporal Sequence of FMRI scans (single slice): from http://www.midwest-medical.net/mri.sagittal.head.jpg [p8]
Three Dimensional Image of Brain Activation from http://www.fmrib.ox.ac.uk/fmri_intro/brief.html [p8]
http://en.wikipedia.org/wiki/Feature_selection [p10-12]
http://en.wikipedia.org/wiki/Singular_value_decomposition [p27]
http://en.wikipedia.org/wiki/Normal_distribution [p38]
http://en.wikipedia.org/wiki/Laplacian_distribution [p38]
http://webdocs.cs.ualberta.ca/~mahdavif/ReadingGroup/Papers/larS.pdf [p20]
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Least Angle Regression [p20]
http://www.stanford.edu/~hastie/TALKS/larstalk.pdf

Reference

Kevin P. Murphy. Machine Learning A Probabilistic Perspective[p59]
Prof.Schuurmans' notes on LASSO [p40]
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection [p8]
Hui Zou and Trevor Hastie. Regularization and Variable Selection via the Elastic Net [p59-62]
http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf [p59-62]
Robert Tibshirani and Michael Saunders, Sparsity and smoothness via the fused LASSO [P63-p65]
Jerome Friedman Trevor Hastie and Robert Tibshirani. A note on the group LASSO and a sparse group LASSO [p66-68]
Guillaume Obozinski, Ben Taskar, and Michael Jordan. Multi-task feature selection [p69-70, p72-p75]
Xi Chen, Seyoung Kim, Qihang Lin, Jaime G. Carbonell, Eric P. Xing. Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused LASSO [p76-77]

Files

index.md

Latest commit

History

index.md

File metadata and controls

Outline

Introduction to Dimensionality Reduction

Linear Regression and Least Squares (Review)

Subset Selection

Shrinkage Methods

Beyond LASSO

Part 1: Introduction to Dimensionality Reduction

General Notations

Dataset

Basic Model

Motivations

Feature Selection and Feature Extraction

Feature Selection

Feature Extraction

Feature Selection and Feature Extraction

Example 1: Prostate Cancer

Feature Selection and Feature Extraction

Example 2: classification with fMRI data

Feature Selection Methods

Wrapper Methods

Feature Selection Methods

Embedded Methods

Feature Selection Methods

Filter Methods

Feature Selection

Comparison

Feature Extraction

Principle Components Analysis

Part 2: Linear Regression and Least Squares (Review)

Linear Regression and Least Squares (Review)

Least Squares Fit

Gauss Markov Theorem

Question

Linear Regression and Least Squares (Review)

Bias-Variance tradeoff

Problems of Least Squares

Part 3: Subset Selection Methods

Subset Selection Methods

Best-subset selection

Subset Selection Methods

Forward-STEPWISE selection

Subset Selection Methods

Forward-STAGEWISE Regression

Subset Selection Methods

Comparison

Summary of Subset Selection Methods

Advantages w.r.t Least Squares

Disadvantages w.r.t. Continuos Process

Part 4: Shrinkage Methods

Ridge Regression

Ridge Regression

Simulation Experiment

Ridge Regression

Singular Value Decomposition (SVD)

Ridge Regression

Singular Value Decomposition (SVD)

Ridge Regression

Singular Value Decomposition (SVD)

Ridge Regression

Degree of Freedom

Ridge Regression

Advantages

Ridge Regression

Disadvantages w.r.t. Subset Selection Methods

Part 4: Shrinkage Methods - LASSO

LASSO

Linear regression with $l_1$-regularization

LASSO

LASSO

LASSO

Comparison

LASSO

Comparison

LASSO