index.html

<!DOCTYPE html>
<html>
<head>
  <title>Dimensionality Reduction</title>
  <meta charset="utf-8">
  <meta name="description" content="Dimensionality Reduction">
  <meta name="author" content="Ping Jin (pjin1@ualberta.ca) and Prof. Russell Greiner (rgreiner@ualberta.ca)">
  <meta name="generator" content="slidify" />
  <meta name="apple-mobile-web-app-capable" content="yes">
  <meta http-equiv="X-UA-Compatible" content="chrome=1">
  <link rel="stylesheet" href="libraries/frameworks/io2012/css/default.css" media="all" >
  <link rel="stylesheet" href="libraries/frameworks/io2012/phone.css" 
    media="only screen and (max-device-width: 480px)" >
  <link rel="stylesheet" href="libraries/frameworks/io2012/css/slidify.css" >
  <link rel="stylesheet" href="libraries/highlighters/highlight.js/css/tomorrow.css" />
  <base target="_blank"> <!-- This amazingness opens all links in a new tab. -->
  <script data-main="libraries/frameworks/io2012/js/slides" 
    src="libraries/frameworks/io2012/js/require-1.0.8.min.js">
  </script>
  
    <link rel="stylesheet" href = "assets/css/ribbons.css">
  
</head>
<body style="opacity: 0">
  <slides class="layout-widescreen">
    
    <!-- LOGO SLIDE -->
    <!-- END LOGO SLIDE -->
    

    <!-- TITLE SLIDE -->
    <!-- Should I move this to a Local Layout File? -->
    <slide class="title-slide segue nobackground">
      <hgroup class="auto-fadein">
        <h1>Dimensionality Reduction</h1>
        <h2>CMPUT 466/551</h2>
        <p>Ping Jin (pjin1@ualberta.ca) and Prof. Russell Greiner (rgreiner@ualberta.ca)<br/></p>
      </hgroup>
          </slide>

    <!-- SLIDES -->
      <slide class="" id="slide-1" style="background:;">
  <hgroup>
    <h2>Outline</h2>
  </hgroup>
  <article>
    <ul>
<li><h3>Introduction to Dimensionality Reduction</h3></li>
<li><h3>Linear Regression and Least Squares (Review)</h3></li>
<li><h3>Subset Selection</h3></li>
<li><h3>Shrinkage Methods</h3></li>
<li><h3>Beyond LASSO</h3></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-2" style="background:;">
  <hgroup>
    <h2>Part 1: Introduction to Dimensionality Reduction</h2>
  </hgroup>
  <article>
    <ol>
<li><b>Introduction to Dimensionality Reduction</b>

<ul>
<li><b>General notations</b></li>
<li><b>Motivations</b></li>
<li><b>Feature selection and feature extraction</b></li>
<li><b>Feature Selection</b></li>
<li><b>Feature Extraction</b></li>
</ul></li>
<li>Linear Regression and Least Squares (Review)</li>
<li>Subset Selection</li>
<li>Shrinkage Methods</li>
<li>Beyond LASSO<br></li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-3" style="background:;">
  <hgroup>
    <h2>General Notations</h2>
  </hgroup>
  <article>
    <h3>Dataset</h3>

  
<div class='left' style='float:left;width:50%'>
 <ul>
<li>\(\mathbf{X}\): columnwise centered \(N \times p\) matrix

<ul>
<li>\(N:\) # samples, \(p:\) # features</li>
<li>An intercept vector \(\mathbf{1}\) is added to \(\mathbf{X}\), then \(\mathbf{X}\) is \(N \times (p+1)\) matrix</li>
</ul></li>
<li>\(\mathbf{y}\): \(N \times 1\) vector of labels(classification) or continous values(regression)</li>
</ul>


</div>    
<div class='right' style='float:right;width:50%'>
 <p><center><img src="assets/img/xy.png" alt="x" title="xy"></center></p>


</div>
<div style='float:left;width:100%;' class='centered'>
  <h3>Basic Model</h3>

<ul>
<li>Linear Regression

<ul>
<li>Assumption: the regression function \(E(Y|X)\) is linear
\[f(X) = X^T\beta\]</li>
<li>\(\beta\): \((p+1) \times 1\) vector of coefficients</li>
</ul></li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-4" style="background:;">
  <hgroup>
    <h2>Motivations</h2>
  </hgroup>
  <article>
    <ul>
<li>Dimensionality Reduction is about transforming data with high dimensionality into data of much lower dimensionality

<ul>
<li><b>Computational efficiency</b>: less dimensions require less computations</li>
<li><b>Accuracy</b>: lower risk of overfitting</li>
</ul></li>
</ul>

  
<div class='left' style='float:left;width:53%'>
 <ul>
<li><b>Categories</b>

<ul>
<li>Feature Selection:<br>

<ul>
<li>chooses a subset of features from the original feature set</li>
</ul></li>
<li>Feature Extraction:

<ul>
<li>transforms the original features into new ones, linearly or non-linearly</li>
<li>e.g. PCA, ICA, etc.</li>
</ul></li>
</ul></li>
</ul>


</div>    
<div class='right' style='float:right;width:47%'>
 <p><br></p>

<p><center><img src="assets/img/fs.gif" alt="fs" title="fs"></center></p>

<p><center><img src="assets/img/fe.gif" alt="fe" title="fe"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-5" style="background:;">
  <hgroup>
    <h2>Feature Selection and Feature Extraction</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:50%'>
 <h3>Feature Selection</h3>

<ul>
<li>Easier to interpret</li>
<li>Reduces cost: computation, budget, etc.</li>
</ul>

<p><br>
<br>
<br>
<center><img src="assets/img/fs2.gif" alt="fs2" title="fs2"></center></p>


</div>    
<div class='right' style='float:right;width:50%'>
 <h3>Feature Extraction</h3>

<ul>
<li>More flexible. Feature selection is a special case of linear feature extraction</li>
</ul>

<p><br>
<br>
<br></p>

<p><center><img src="assets/img/fe2.gif" alt="fe2" title="fe2"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-6" style="background:;">
  <hgroup>
    <h2>Feature Selection and Feature Extraction</h2>
  </hgroup>
  <article>
    <h3>Example 1: Prostate Cancer</h3>

<ul>
<li><b>Response</b>: level of prostate-specific antigen (lpsa). </li>
<li><b>Initial Feature Set</b>:
\[\{lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45\}.\]</li>
<li><b>Task</b>:

<ul>
<li>predict \(lpsa\) from measurements of features</li>
</ul></li>
</ul>

<p>Feature selection</p>

<ul>
<li>Cost: Measuring features cost money</li>
<li>Interpretation: Doctors can see which features are important</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-7" style="background:;">
  <hgroup>
    <h2>Feature Selection and Feature Extraction</h2>
  </hgroup>
  <article>
    <h3>Example 2: classification with fMRI data</h3>

<ul>
<li><p>fMRI data are 4D images, with one dimension being time. </p></li>
<li><p>Each image is ~ \(50 \times 50 \times 50\)(spatial) \(\times 200\)(times) \(= 25M\) dimensions</p></li>
</ul>

<p>Feature extraction </p>

<ul>
<li>Individual voxel-times are not important </li>
<li>Cost is not correlated with #features</li>
<li>Feature extraction offers more flexibility in transforming features, which potentially results in better accuracy</li>
</ul>

<p><center><img src="assets/img/fMRI2.png" alt="fMRI2" title="fmri2"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-8" style="background:;">
  <hgroup>
    <h2>Feature Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Wrapper Methods</h3>

<ul>
<li>Search the space of feature subsets</li>
<li>Use the cross validation accuracy w.r.t. a specific classifier as the measure of utility for a candidate subset</li>
<li>e.g. see how it works for a feature set {1, 2, 3} in the figure below

<ul>
<li>\(1,2\), and \(3\) represent the \(1st\), \(2nd\) and \(3rd\) feature respectively</li>
</ul></li>
</ul>

<p><center><img src="assets/img/wrapper.png" alt="wrapper" title="wrapper"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-9" style="background:;">
  <hgroup>
    <h2>Feature Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Embedded Methods</h3>

<ul>
<li>exploit the structure of speciﬁc classes of learning models to guide the feature selection process</li>
<li>embedded as part of the model construction process

<ul>
<li>e.g. LASSO. </li>
</ul></li>
</ul>

<p><center><img src="assets/img/embedded.png" alt="embedded" title="embedded"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-10" style="background:;">
  <hgroup>
    <h2>Feature Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Filter Methods</h3>

<ul>
<li>use some general rules/criterions to measure the feature selection results independent of the classifiers</li>
<li>e.g. mutual information</li>
</ul>

<p><center><img src="assets/img/filter.png" alt="filter" title="filter"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-11" style="background:;">
  <hgroup>
    <h2>Feature Selection</h2>
  </hgroup>
  <article>
    <h3>Comparison</h3>

<table><thead>
<tr>
<th></th>
<th align="center">Wrapper</th>
<th align="right">Filter</th>
<th align="right">Embedded</th>
</tr>
</thead><tbody>
<tr>
<td>Computational Speed</td>
<td align="center">Low</td>
<td align="right">High</td>
<td align="right">Mid</td>
</tr>
<tr>
<td>Chance of Overfitting</td>
<td align="center">High</td>
<td align="right">Low</td>
<td align="right">Mid</td>
</tr>
<tr>
<td>Classifier-Independent</td>
<td align="center">No</td>
<td align="right">Yes</td>
<td align="right">No</td>
</tr>
</tbody></table>

<ul>
<li>Wrapper methods have the strongest learning/representation capability among the three

<ul>
<li>often fit training dataset better than the other two</li>
<li>prone to overfitting for small datasets</li>
<li>require more data to reliably get a near-optimal approximation. </li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-12" style="background:;">
  <hgroup>
    <h2>Feature Extraction</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:50%'>
 <h3>Principle Components Analysis</h3>

<ul>
<li><b>A graphical explanation</b>

<ul>
<li>Each data sample has three features</li>
<li>Original features are transformed into new ones</li>
<li>Often use only the new features with largest variance</li>
</ul></li>
<li><b>Example</b>

<ul>
<li>For fMRI images, we usually have millions of dimensions. PCA can project the data from millions of dimensions to only thousands of dimensions, or even less</li>
</ul></li>
<li>Other feature extraction methods: ICA, Kernel PCA , etc..</li>
</ul>


</div>    
<div class='right' style='float:right;width:48%'>
 <p><center><img src="assets/img/pca.png" alt="alt text" title="Principle component analysis"></center></p>

<p><center><img src="assets/img/pca_var.png" alt="alt text" title="Principle component analysis"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-13" style="background:;">
  <hgroup>
    <h2>Part 2: Linear Regression and Least Squares (Review)</h2>
  </hgroup>
  <article>
    <ol>
<li>Introduction to Dimensionality Reduction</li>
<li><b>Linear Regression and Least Squares (Review)</b>

<ul>
<li><b>Least Square Fit</b></li>
<li><b>Gauss Markov</b></li>
<li><b>Bias-Variance tradeoff</b></li>
<li><b>Problems</b></li>
</ul></li>
<li>Subset Selection</li>
<li>Shrinkage Methods</li>
<li>Beyond LASSO</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-14" style="background:;">
  <hgroup>
    <h2>Linear Regression and Least Squares (Review)</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:58%'>
 <h3>Least Squares Fit</h3>

<p>\[
\begin{equation}
\begin{split}
RSS(\beta) &= (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta)\\
\frac{\partial RSS}{\partial \beta} &= -2 \mathbf{X}^T(\mathbf{y} - \mathbf{X}\beta) = 0
\quad \Rightarrow \quad \hat{\beta}^{ls} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}
\end{split}
\end{equation}
\]</p>

<h3>Gauss Markov Theorem</h3>

<p>The least squares estimates \(\hat{\beta}^{ls}\) of the parameters β have the smallest variance among all linear unbiased estimates.</p>

<h3>Question</h3>

<p>Is it good to be unbiased?</p>


</div>    
<div class='right' style='float:right;width:38%'>
 <p><img src="assets/img/lr.png" alt="Linear regression" title="Linear regression"></p>

<p><img src="assets/img/ls.png" alt="Least Squares" title="Least squares"></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-15" style="background:;">
  <hgroup>
    <h2>Linear Regression and Least Squares (Review)</h2>
  </hgroup>
  <article>
    <h3>Bias-Variance tradeoff</h3>

<p>\[
\begin{equation}
\begin{split}
MSE(\hat{\mathbf{y}}) &= E[(\hat{\mathbf{y}} - Y)^2]\\
&= Var(\hat{\mathbf{y}}) + [E[\hat{\mathbf{y}}] - Y]^2
\end{split}
\end{equation}
\]</p>

<p>where \(Y = X^T\beta\). We can trade increase in bias for much less variance.</p>

<h3>Problems of Least Squares</h3>

<ul>
<li><b>Prediction accuracy</b>: unbiased, but higher variance than many biased estimator (leading to higher MSE), overfitting noise and sensitive to outliers</li>
<li><b>Interpretation</b>:  \(\hat{\beta}\) involves all of the features.
Better to have SIMPLER linear model, that involves only a few features...</li>
<li>Recall that \(\hat{\beta}^{ls} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

<ul>
<li>\((\mathbf{X}^T\mathbf{X})\) may be <b>not invertible</b> and thus no closed form solution</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-16" style="background:;">
  <hgroup>
    <h2>Part 3: Subset Selection Methods</h2>
  </hgroup>
  <article>
    <ol>
<li>Introduction to Dimensionality Reduction</li>
<li>Linear Regression and Least Squares (Review)</li>
<li><b>Subset Selection</b>

<ul>
<li><b>Best-subset selection</b></li>
<li><b>Forward stepwise selection</b></li>
<li><b>Forward stagewise selection</b></li>
<li><b>Problems</b></li>
</ul></li>
<li>Shrinkage Methods</li>
<li>Beyond LASSO</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-17" style="background:;">
  <hgroup>
    <h2>Subset Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Best-subset selection</h3>

<ul>
<li>Best subset regression finds for each \(k \in \{0, 1, 2, . . . , p\}\) the subset of features of size \(k\) that gives smallest RSS. </li>
<li>Then cross validation is utilized to choose the best \(k\)</li>
<li>An efficient algorithm, the leaps and bounds procedure (Furnival and Wilson, 1974), makes this feasible for \(p\) as large as 30 or 40.</li>
</ul>

<p><center><img src="assets/img/best_sub.png" alt="best_sub" title="best_sub"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-18" style="background:;">
  <hgroup>
    <h2>Subset Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Forward-STEPWISE selection</h3>

<p>Instead of searching all possible subsets, we can seek a good path through them. </p>

<ul>
<li>a <b>sequential greedy</b> algorithm.</li>
</ul>

<p><em>Forward-Stepwise Selection</em> builds a model sequentially, adding one variable at a time. </p>

<ul>
<li>Initialization

<ul>
<li>Active set \(\mathcal{A} = \emptyset\), \(\mathbf{r} = \mathbf{y}\), \(\beta = 0\)</li>
</ul></li>
<li>At each step, it

<ul>
<li>identifies the best variable (with the highest correlation with the residual error)
\[\mathbf{k} = argmax_{j}(|correlation(\mathbf{x}_j, \mathbf{r})|)\]</li>
<li>\(A = A \cup \{\mathbf{k}\}\)</li>
<li>then updates the least squares fit \(\beta\), \(\mathbf{r}\) to include all the active variables</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-19" style="background:;">
  <hgroup>
    <h2>Subset Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Forward-STAGEWISE Regression</h3>

  
<div class='left' style='float:left;width:55%'>
 <ul>
<li>Initialize the fit vector \(\mathbf{f} = 0\)</li>
<li>For each time step

<ul>
<li>Compute the correlation vector 
\[\mathbf{c} = (\mathbf{c}_1, ..\mathbf{c}_p)\]

<ul>
<li>\(\mathbf{c}_j\) represents the correlation between \(\mathbf{x}_j\) and the residual error</li>
</ul></li>
<li>\(k = argmax_{j \in \{1,2,..,p\}} |\mathbf{c}_j|\)</li>
<li>Coefficients and fit vector are updated
\[\mathbf{f} \gets \mathbf{f} + \alpha \cdot sign(\mathbf{c}_k) \mathbf{x}_k\]
\[\beta_k \gets \beta_k + \alpha \cdot sign(\mathbf{c}_k)\] 
where \(\alpha\) is the learning rate</li>
</ul></li>
</ul>


</div>    
<div class='right' style='float:right;width:43%'>
 <p><img src="assets/img/stagewise.png" alt="Stagewise" title="Stagewise"></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-20" style="background:;">
  <hgroup>
    <h2>Subset Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Comparison</h3>

  
<div class='left' style='float:left;width:40%'>
 <ul>
<li>Forward-STEPWISE selection: 

<ul>
<li>algorithm stops in \(p\) steps</li>
</ul></li>
<li>Forward-STAGEWISE selection: 

<ul>
<li>is a slow fitting algorithm, at each time step, only \(\beta_k\) is updated. Alg can take more than \(p\) steps  to stop</li>
</ul></li>
</ul>


</div>    
<div class='right' style='float:right;width:55%'>
 <p><center><img src="assets/img/comp1.png" alt="comp1" title="comp"></center></p>

<ul>
<li>\(N = 300\) Observations, \(p = 31\) features</li>
<li>averaged over 50 simulations</li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-21" style="background:;">
  <hgroup>
    <h2>Summary of Subset Selection Methods</h2>
  </hgroup>
  <article>
    <h3>Advantages w.r.t Least Squares</h3>

<ul>
<li>More interpretable result</li>
<li>More compact model</li>
</ul>

<h3>Disadvantages w.r.t. Continuos Process</h3>

<ul>
<li>It is a discrete process, and thus has high variance and is very sensitive to changes in the dataset

<ul>
<li>If the dataset changes a little, the feature selection result may be very different<br></li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-22" style="background:;">
  <hgroup>
    <h2>Part 4: Shrinkage Methods</h2>
  </hgroup>
  <article>
    <ol>
<li>Introduction to Dimensionality Reduction</li>
<li>Linear Regression and Least Squares (Review)</li>
<li>Subset Selection</li>
<li><b>Shrinkage Methods</b>

<ul>
<li><b>Ridge Regression</b>

<ul>
<li><b>Formulations and closed form solution</b></li>
<li><b>Singular value decomposition</b></li>
<li><b>Degree of Freedom</b></li>
</ul></li>
<li>LASSO</li>
</ul></li>
<li>Beyond LASSO</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-23" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <ul>
<li><b>Least squares with quadratic constraints</b>
\[
\begin{equation}
\hat{\beta}^{ridge}= argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x}_{ij}\beta_j)^2, \quad s.t. \quad \sum_{j = 1}^p \beta_j^2 \leq t
\end{equation}
\]</li>
<li><b>Its Lagrange form</b>
\[
\hat{\beta}^{ridge} = argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2 + \lambda \sum_{j = 1}^p\beta_j^2
\]</li>
<li><p>The \(l_2\)-regularization can be viewed as a Gaussian prior on the coefficients, our solution as the posterior means</p></li>
<li><p><b>Solution</b></p></li>
</ul>

<p>\[
\begin{equation}
\begin{split}
&RSS(\beta) = (\mathbf{y} - \mathbf{X}\beta)^T(\mathbf{y} - \mathbf{X}\beta) + \lambda \beta^T\beta\\
&\partial RSS(\beta)/ \partial \beta = 0  \quad \Rightarrow\quad \hat{\beta}^{ridge} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}
\end{split}
\end{equation}
\]    </p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-24" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <h3>Simulation Experiment</h3>

  
<div class='left' style='float:left;width:50%'>
 <ul>
<li>\(N = 30\)</li>
<li>\(\mathbf{x}_1 \sim N(0, 1)\)</li>
<li>\(\beta \sim (U(-0.5,0.5), U(-0.5,0.5))\)</li>
</ul>


</div>    
<div class='right' style='float:right;width:50%'>
 <ul>
<li>\(\mathbf{y} = (\mathbf{x}_1, \mathbf{x}_1^2) \times \beta\)</li>
<li>\(\mathbf{X} = (\mathbf{x}_1, \mathbf{x}^2_1, ..., \mathbf{x}^8_1)\)</li>
<li>Dataset avalible: {\(\mathbf{X}\), \(\mathbf{y}\)}</li>
</ul>


</div>
<div style='float:left;width:100%;' class='centered'>
  <p><center><img src="assets/img/lst.png" alt="lst" title="lst"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-25" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <h3>Singular Value Decomposition (SVD)</h3>

<p>SVD offers some additional insight into the nature of ridge regression. </p>

  
<div class='left' style='float:left;width:50%'>
 <ul>
<li><b>The SVD of</b> \(\mathbf{X}\):
\[\mathbf{X} = \mathbf{UDV}^T\]

<ul>
<li>\(\mathbf{U}\): \(N \times p\) <b>orthogonal</b> matrix with columns spanning the column space of \(\mathbf{X}\). 

<ul>
<li>\(\mathbf{u}_j\) is the $j$th column of \(\mathbf{U}\)</li>
</ul></li>
<li>\(\mathbf{V}\): \(p \times p\) <b>orthogonal</b> matrix with columns spanning the row space of \(\mathbf{X}\). 

<ul>
<li>\(\mathbf{v}_j\) is the $j$th column of \(\mathbf{V}\)<br></li>
</ul></li>
<li>\(\mathbf{D}\): \(p \times p\) <b>diagonal</b> matrix with diagonal entries \(d_1 \geq d_2 \geq ... \geq d_p \geq 0\) being the singular values of \(\mathbf{X}\)</li>
</ul></li>
</ul>


</div>    
<div class='right' style='float:right;width:50%'>
 <p><center><img src="assets/img/svd2.png" alt="svd2" title="svd2"></center></p>

<p><center><img src="assets/img/svd.gif" alt="svd" title="svd"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-26" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <h3>Singular Value Decomposition (SVD)</h3>

  
<div class='left' style='float:left;width:50%'>
 <ul>
<li><b>For least squares</b>
\[
\begin{equation}
\begin{split}
\mathbf{X}\hat{\beta}^{ls} &= \mathbf{X(X^TX)^{-1}X^Ty}\\
&=\mathbf{UU^Ty} =\sum_{j=1}^p\mathbf{u}_j \mathbf{u}_j^T\mathbf{y}
\end{split}
\end{equation}
\]</li>
</ul>


</div>    
<div class='right' style='float:right;width:50%'>
 <ul>
<li><b>For ridge regression</b>
\[
\begin{equation}
\begin{split}
\mathbf{X}\hat{\beta}^{ridge} &= \mathbf{X(X^TX + \lambda I)^{-1}X^Ty}\\
&=\sum_{j=1}^p\mathbf{u}_j\frac{d_j^2}{d_j^2 + \lambda} \mathbf{u}_j^T\mathbf{y}
\end{split}
\end{equation}
\]</li>
</ul>


</div>
<div style='float:left;width:100%;' class='centered'>
  <ul>
<li>Compared with the solution of least squares, we have an additional shrinkage term 
\[\frac{d_j^2}{d_j^2 + \lambda},\] 
the smaller \(d_j\) is and the larger \(\lambda\) is, the more shrinkage we have. </li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-27" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <h3>Singular Value Decomposition (SVD)</h3>

  
<div class='left' style='float:left;width:55%'>
 <ul>
<li>\(N = 100\), \(p = 10\)</li>
</ul>

<p><img src="assets/img/ls_pc.png" alt="ls_pc" title="ls_pc"></p>

<p><img src="assets/img/rr_pc.png" alt="rr_pc" title="rr_pc"></p>


</div>    
<div class='right' style='float:right;width:45%'>
 <p><center><img src="assets/img/shrink.png" alt="shrink" title="shrink"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-28" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <h3>Degree of Freedom</h3>

  
<div class='left' style='float:left;width:50%'>
 <ul>
<li>The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The degree of freedom of ridge estimate is related to \(\lambda\), thus defined as \(df(\lambda)\).</li>
<li>Computation
\[
\begin{equation}
\begin{split}
df(\lambda) &= tr[\mathbf{X(X^TX + \lambda I)^{-1}X^T}]\\
&=\sum_{j=1}^p \frac{d_j^2}{d_j^2 + \lambda} 
\end{split}
\end{equation}
\]</li>
<li>[larger \(\lambda\)] \(\rightarrow\) [smaller \(df(\lambda)\)] \(\rightarrow\) [more constrained model]</li>
<li>The red line gives the best \(df(\lambda)\) identified from cross validation w.r.t RSS</li>
</ul>


</div>    
<div class='right' style='float:right;width:48%'>
 <p><center><img src="assets/img/df.png" alt="df" title="df"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-29" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <h4>Advantages</h4>

<ul>
<li> w.r.t. Least Squares

<ul>
<li>\((\mathbf{X^TX + \lambda I})\) is always invertible and thus the closed form solution always exist</li>
<li>Ridge regression controls the complexity with regularization term via \(\lambda\), which is less prone to overfitting compared with least squares fit, </li>
<li>Possibly higher prediction accuracy, as the estimates of ridge regression trade a little bias for less variance</li>
</ul></li>
<li> w.r.t. Subset Selection Methods

<ul>
<li>Ridge regression is a continuous shrinkage method that has less variance than subset selection methods</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-30" style="background:;">
  <hgroup>
    <h2>Ridge Regression</h2>
  </hgroup>
  <article>
    <h4>Disadvantages w.r.t. Subset Selection Methods</h4>

<ul>
<li>Compactness: 

<ul>
<li>Computational efficiency: 

<ul>
<li>though we have a closed form solution, computing matrix inversions takes time and memory</li>
<li>it takes longer to predict for future samples with more features</li>
</ul></li>
</ul></li>
<li>Interpretation<br>

<ul>
<li>offers little interpretations </li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-31" style="background:;">
  <hgroup>
    <h2>Part 4: Shrinkage Methods - LASSO</h2>
  </hgroup>
  <article>
    <ol>
<li>Introduction to Dimensionality Reduction</li>
<li>Linear Regression and Least Squares (Review)</li>
<li>Subset Selection</li>
<li><b>Shrinkage Methods</b>

<ul>
<li>Ridge Regression</li>
<li><b>LASSO</b>

<ul>
<li><b>Formulations</b></li>
<li><b>Comparisons with ridge regression and subset selection</b></li>
<li><b>Solution of LASSO</b></li>
<li><b>Viewed as approximation for \(l_0\)-regularization</b></li>
</ul></li>
</ul></li>
<li>Beyond LASSO</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-32" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <h3>Linear regression with \(l_1\)-regularization</h3>

<ul>
<li><p><b>Formulations</b></p>

<ul>
<li><b>Least squares with constraints</b>
\[
\begin{equation}
\hat{\beta}^{LASSO}= argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2, \quad s.t. \sum_{j = 1}^p |\beta_j| \leq t
\end{equation}
\]</li>
<li><b>Its Lagrange form</b>
\[
\hat{\beta}^{LASSO} = argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2 + \lambda \sum_{j = 1}^p|\beta_j|
\]</li>
<li>The \(l_1\)-regularization can be viewed as a Laplace prior on the coefficients</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-33" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <ul>
<li>\(s = \frac{t}{\sum_{j=1}^p |\hat{\beta}_j|}\), where \(\hat{\beta}\) is the least squares estimate</li>
<li>red lines represent the \(s\) and \(df(\lambda)\) with the best cross validation error</li>
</ul>


<div style='float:left;width:48%;' class='centered'>
  <p><center><img src="assets/img/lasso.png" alt="LASSO" title="LASSO"></center></p>


</div>
<div style='float:right;width:48%;'>
  <p><center><img src="assets/img/df.png" alt="df" title="df"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-34" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <ul>
<li>Introduction to Dimensionality Reduction</li>
<li>Linear Regression and Least Squares (Review)</li>
<li>Subset Selection</li>
<li><b>Shrinkage Methods</b>

<ul>
<li>Ridge Regression</li>
<li><b>LASSO</b>

<ul>
<li>Formulations</li>
<li><b>Comparisons with ridge regression and subset selection</b>

<ul>
<li><b>Orthonormal inputs</b></li>
<li><b>Non-orthonormal inputs</b></li>
</ul></li>
<li>Solution of LASSO</li>
<li>Viewed as approximation for \(l_0\)-regularization</li>
</ul></li>
</ul></li>
<li>Beyond LASSO</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-35" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <h3>Comparison</h3>

<ul>
<li><b>Orthonormal Input \(\mathbf{X}\)</b>

<ul>
<li><b>Best subset</b>: [Hard thresholding] keeps the top \(M\) largest coefficeints of \(\hat{\beta}^{ls}\)</li>
<li><b>Ridge</b>: [Pure shrinkage] does proportional shrinkage of \(\hat{\beta}^{ls}\)</li>
<li><b>LASSO</b>: [Soft thresholding] translates each coefficient of \(\hat{\beta}^{ls}\) by \(\lambda\) towards 0, truncating at 0 </li>
</ul></li>
</ul>

<p><center><img src="assets/img/comp2.png" alt="comp2" title="comp2"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-36" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <h3>Comparison</h3>

<ul>
<li><b>Non-orthonormal Input \(\mathbf{X}\)</b></li>
</ul>

  
<div class='left' style='float:left;width:60%'>
 <p><center><img src="assets/img/comp3.png" alt="comp3" title="comp3"></center></p>


</div>    
<div class='right' style='float:right;width:40%'>
 <ul>
<li><b>Solid blue area</b>: the constraints

<ul>
<li>left: \(|\beta_1| + |\beta_1| \leq t\)</li>
<li>right: \(\beta_1^2 + \beta_1^2 \leq t^2\)</li>
</ul></li>
<li><b>\(\hat{\beta}\)</b>: least squares fit</li>
<li>want to find the point that is nearest to  \(\hat{\beta}\) , within blue region</li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-37" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <h3>Other unit circles for different \(p\)-norms</h3>

<p><center><img src="assets/img/unit_circle.png" alt="uc" title="uc"></center></p>

<table><thead>
<tr>
<th></th>
<th>Convex</th>
<th>Smooth</th>
<th>Sparse</th>
</tr>
</thead><tbody>
<tr>
<td>\(q<1\)</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>\(q>1\)</td>
<td>Yes</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>\(q = 1\)</td>
<td>Yes</td>
<td>No</td>
<td>Yes</td>
</tr>
</tbody></table>

<p>Here \(q = 0\) is the pure variable selection procedure, as it is counting the <b>number of non-zero coefficients</b>.</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-38" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <h3>Regularizations as priors</h3>

<p>\(|\beta_j|^q\) can be viewed as the log-prior density for \(\beta_j\), these three methods below are bayes estimates with different priors</p>

<ul>
<li><b>Subset selection</b>: corresponds to \(q = 0\)</li>
<li><b>LASSO</b>: corresponds to \(q = 1\), Laplace prior, \(density = (\frac{1}{\tau})exp(\frac{-|\beta|}{\tau}), \tau = \sigma/\lambda\)</li>
<li><b>Ridge regression</b>: corresponds to \(q = 2\), Gaussian Prior, \(\beta \sim N(0, \tau \mathbf{I})\), \(\lambda = \frac{\sigma^2}{\tau^2}\)</li>
</ul>

  
<div class='left' style='float:left;width:48%'>
 <p><center><img src="assets/img/laplace.png" alt="laplace" title="laplace"></center></p>


</div>    
<div class='right' style='float:right;width:48%'>
 <p><center><img src="assets/img/gauss.png" alt="gauss" title="gauss"></center></p>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-39" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <ul>
<li>Introduction to Dimensionality Reduction</li>
<li>Linear Regression and Least Squares (Review)</li>
<li>Subset Selection</li>
<li><b>Shrinkage Methods</b>

<ul>
<li>Ridge Regression</li>
<li><b>LASSO</b>

<ul>
<li>Formulations</li>
<li>Comparisons with ridge regression and subset selection</li>
<li><b>Solution of LASSO</b></li>
<li>Viewed as approximation for \(l_0\)-regularization</li>
</ul></li>
</ul></li>
<li>Beyond LASSO</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-40" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <h3>Quadratic Programming</h3>

<ul>
<li>Formulation
\[
min_{\beta}\{ \frac{1}{2}(\mathbf{X}\beta - \mathbf{y})^T (\mathbf{X}\beta - \mathbf{y}) + \lambda \|\beta\|_1\}
\]
is equivalent to 
\[
min_{w, \xi}\{ \frac{1}{2}(\mathbf{X}\beta - \mathbf{y})^T (\mathbf{X}\beta - \mathbf{y}) + \lambda \mathbf{1}^T\xi\}
\]</li>
</ul>

<p>\[
\begin{equation}
\begin{split}
s.t. &\beta_j \leq \xi_j\\
&\beta_j \geq -\xi_j
\end{split}
\end{equation}
\]</p>

<ul>
<li>Note that QP can only solve LASSO for a given \(\lambda\). 

<ul>
<li>Next: method for solving for all \(\lambda\) (LAR)</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-41" style="background:;">
  <hgroup>
    <h2>LAR Algorithm</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:44%'>
 <h3>Algorithm</h3>

<ul>
<li>Standardized all predictors; </li>
<li>\(\mathbf{r}_0 = \mathbf{y} - \bar{\mathbf{y}}\); \(\beta = \mathbf{0}\);</li>
<li>\(k = argmax_{j} |corr(\mathbf{x}_j, \mathbf{r}_0)|\), \(\mathcal{A}_1 = \{k\}\)<br></li>
<li>For time step \(t = 1,2,...min(N-1,p)\)

<ul>
<li>Move \(\beta_{\mathcal{A}_t}\) in the joint least squares direction for \(\mathcal{A}_t\), until some other \(k \not\in \mathcal{A}_t\) has as much correlation with the current residual</li>
<li>\(\mathcal{A}_{t+1} = \mathcal{A}_{t} \cup \{k\}\)</li>
</ul></li>
</ul>


</div>    
<div class='right' style='float:right;width:44%'>
 <h3>Notations</h3>

<ul>
<li>\(\beta\): \(p \times 1\) coefficient vector</li>
<li>\(\mathcal{A}_t\): <i>active set</i>, the set indices of features included in the model at time step \(t\).

<ul>
<li>\(\bar{\mathcal{A}_t} = \{1,2,...,p\} - \mathcal{A}_t\)</li>
</ul></li>
<li>\(\beta_{\mathcal{A}_t}\): \(|\mathcal{A}_t| \times 1\) vector of coefficients, w.r.t \(\mathcal{A}_t\)

<ul>
<li>Contains the \(\beta_j, \quad j \in \mathcal{A}_t\)</li>
</ul></li>
<li>\(\mathbf{X}_{\mathcal{A}_t}\): \([ \mathbf{x}_j ]_{j\in \mathcal{A}_t}\)</li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-42" style="background:;">
  <hgroup>
    <h2>LAR - Example</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:50%'>
 <p><center><img src="assets/img/lars1.png" alt="lars1" title="lars1"></center></p>


</div>    
<div class='right' style='float:right;width:48%'>
 <ul>
<li>Example Setting:

<ul>
<li>\(N = 2\), \(p = 2\)</li>
</ul></li>
<li>Columnwise-standardized data matrix \(\mathbf{X}\) 

<ul>
<li>s.t. \(mean\{\mathbf{x}_j\} = 0\), \(std\{\mathbf{x}_j\} = 1\)</li>
<li>\(\rightarrow \|\mathbf{x}_1\| = \|\mathbf{x}_2\| = ... = \|\mathbf{x}_p\|\)</li>
<li>Two standardized column vectors \(\mathbf{x}_1\) and \(\mathbf{x}_2\) are shown in the left figure<br></li>
</ul></li>
<li>\(\mathcal{A}_0 = \emptyset\), which means that we have not chosen any feature yet<br></li>
<li>\(\beta = (0, 0)^T\)</li>
<li>The \(N\times 1\) fit vector \(\mathbf{f}_0 = \mathbf{X} \beta_{\mathcal{A}_t } = 0\)</li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-43" style="background:;">
  <hgroup>
    <h2>LAR - Example</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:55%'>
 <p><center><img src="assets/img/lars2.png" alt="lars2" title="lars2"></center></p>


</div>    
<div class='right' style='float:right;width:45%'>
 <ul>
<li>\(k = argmax_{j} |corr(\mathbf{x}_j, \mathbf{r}_0)| = 1\)</li>
<li>\(\mathcal{A}_1 = \mathcal{A}_0 \cup \{1\} = \{1\}\)</li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-44" style="background:;">
  <hgroup>
    <h2>LAR - Example</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:55%'>
 <p><center><img src="assets/img/lars3.png" alt="lars3" title="lars3"></center></p>


</div>    
<div class='right' style='float:right;width:45%'>
 <h3>Explanations</h3>

<ul>
<li>\(\mathbf{r}_1 = \mathbf{y} - \mathbf{X}_{\mathcal{A}_1} \beta_{\mathcal{A}_1}\) is the residual error at the beginning of time \(1\)</li>
<li>\(\delta_1 = \mathbf{(X^T_{\mathcal{A}_1} X_{\mathcal{A}_1})^{-1}X^T_{\mathcal{A}_1}r_1}\) is the least square estimates of the coefficients whose corresponding features in \(\mathcal{A}_1 = \{1\}\) w.r.t. residual error \(\mathbf{r}_1\)

<ul>
<li>\(\delta_1\) is the direction that coefficients \(\beta_{\mathcal{A}_1}\) changes along</li>
</ul></li>
<li>\(\mathbf{u}_1 = \mathbf{X}_{\mathcal{A}_1} \delta_1\)

<ul>
<li>As \(\beta_{\mathcal{A}_1}\) changes along \(\delta_1\), the fit \(\mathbf{f}\) changes along \(\mathbf{u}_1\)</li>
</ul></li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-45" style="background:;">
  <hgroup>
    <h2>LAR - Example</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:55%'>
 <p><center><img src="assets/img/lars3.png" alt="lars3" title="lars3"></center></p>


</div>    
<div class='right' style='float:right;width:45%'>
 <h3>Comparison</h3>

<ul>
<li>\(\mathbf{r}_1\)

<ul>
<li>\(\delta_1 = \mathbf{(X^T_{\mathcal{A}_1} X_{\mathcal{A}_1})^{-1}X^T_{\mathcal{A}_1}r_1}\)</li>
<li>\(\mathbf{u}_1 = \mathbf{X}_{\mathcal{A}_1} \delta_1\)</li>
</ul></li>
<li>\(\mathbf{y}\)

<ul>
<li>\(\hat{\beta} = \mathbf{(X^T X)^{-1}X^Ty}\)</li>
<li>\(\hat{\mathbf{y}} = \mathbf{X} \hat{\beta}\)</li>
</ul></li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-46" style="background:;">
  <hgroup>
    <h2>LAR - Example</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:55%'>
 <p><center><img src="assets/img/lars4.png" alt="lars4" title="lars4"></center></p>


</div>    
<div class='right' style='float:right;width:45%'>
 <h3>Explanations</h3>

<ul>
<li>As \(\beta_{\mathcal{A}_1}\) moves along \(\delta_1\), the correlation between the coefficient of the feature in \(\mathcal{A}_1 = \{1\}\) with the residual error decreases</li>
<li>\(\mathbf{f}_1\), intialized as \(\mathbf{f}_0\), moves along \(\mathbf{u}_1\)</li>
<li>At last, the correlation between the coefficient of feature \(k = 2\) and residual error catches up</li>
<li>\(\mathcal{A}_2 = \mathcal{A}_1 \cup \{2\} = \{1, 2\}\)</li>
<li>Note that the fit \(\mathbf{f}_1\) approaches the least squares fit \(\mathbf{f}_1^{ls}\), but not reach it</li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-47" style="background:;">
  <hgroup>
    <h2>LAR - Example</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:55%'>
 <p><center><img src="assets/img/lars5.png" alt="lars5" title="lars5"></center></p>


</div>    
<div class='right' style='float:right;width:45%'>
 <h3>Explanations</h3>

<ul>
<li>The joint least squares direction, which \(\beta_{\mathcal{A}_2}\) moves along, is \(\delta_2 = \mathbf{(X^T_{\mathcal{A}_2} X_{\mathcal{A}_2})^{-1}X^T_{\mathcal{A}_2}r_2}\)

<ul>
<li>\(\mathbf{r}_2 = \mathbf{y} - \mathbf{X}_{\mathcal{A}_2} \beta_{\mathcal{A}_2}\)</li>
</ul></li>
<li>The direction our fit move along is \(\mathbf{u}_2 = \mathbf{X}_{\mathcal{A}_2} \delta_2\)

<ul>
<li>Note that \(\mathbf{u}_2\) is the bisector of \(\mathbf{x}_1\) and \(\mathbf{x}_2\)</li>
<li>In general, \(\mathbf{u}_t\) is the &quot;bisector&quot; of (has the same angle with) all \(\mathbf{x}_j,\quad j \in \mathcal{A}_t\)</li>
</ul></li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-48" style="background:;">
  <hgroup>
    <h2>LAR - Example</h2>
  </hgroup>
  <article>
      
<div class='left' style='float:left;width:55%'>
 <p><center><img src="assets/img/lars6.png" alt="lars6" title="lars6"></center></p>


</div>    
<div class='right' style='float:right;width:45%'>
 <h3>Explanations</h3>

<ul>
<li>As \(\beta_{\mathcal{A}_2}\) moves along \(\delta_2\), the fit \(\mathbf{f}_2\), initialized as \(\mathbf{f}_1\),

<ul>
<li>moves along \(\mathbf{u}_2\)</li>
<li>approaches \(\mathbf{f}_2^{ls}\)</li>
</ul></li>
<li>As we only have \(p = 2\) features, finally

<ul>
<li>\(\mathbf{f}_2 = \mathbf{f}^{ls}_2\)</li>
</ul></li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-49" style="background:;">
  <hgroup>
    <h2>LAR</h2>
  </hgroup>
  <article>
    <h3>More Comments</h3>

<ul>
<li>LAR solves the subset selection problem for all \(t, s.t. \|\beta\| \leq t\)</li>
<li>LAR algorithm ends in \(min(p, N-1)\) steps</li>
</ul>

<p><center><img src="assets/img/lar_corr.png" alt="corr" title="cor"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-50" style="background:;">
  <hgroup>
    <h2>LAR</h2>
  </hgroup>
  <article>
    <h3>Result compared with LASSO</h3>

<h3>Observations</h3>

<p>When the blue line coefficient cross zero, LAR and LASSO become different.</p>

<p><center><img src="assets/img/comp4.png" alt="comp4" title="comp4"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-51" style="background:;">
  <hgroup>
    <h2>LAR</h2>
  </hgroup>
  <article>
    <h3>Result compared with LASSO</h3>

<h3>Modification for LASSO</h3>

<p>During the searching procedure, if a non-zero coefficient hits zero, drop this variable from \(\mathcal{A}_t\), and recompute the direction \(\delta_t\)</p>

<p><center><img src="assets/img/comp4.png" alt="comp4" title="comp4"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-52" style="background:;">
  <hgroup>
    <h2>LAR</h2>
  </hgroup>
  <article>
    <h3>Some heuristic analysis</h3>

<ul>
<li><p>At a certain time point, we know that all \(\mathbf{x}_j \in \mathcal{A}\) share the same absolute values of correlations with the residual error. That is
\[\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta) = \gamma \cdot s_j, \quad \forall j \ \in \mathcal{A}\]
where \(s_j = sign(\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta)) \in \{-1,1\}\) and \(\gamma\) is the common value. </p>

<ul>
<li>We also know that \(|\mathbf{x_j}(\mathbf{y} - \mathbf{X}\beta)| \leq \gamma, \quad \forall j \not\in \mathcal{A}\)</li>
</ul></li>
<li><p>Consider LASSO for a fixed \(\lambda\). Let \(\mathcal{B}\) be the set of indices of non-zero coefficients, then we differentiate the objective function w.r.t. those coefficients in \(\mathcal{B}\) and set the gradient to zero. We have
\[\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta) = \lambda \cdot sign(\beta_j), \quad \forall j \in \mathcal{B}\]</p></li>
<li><p>They are identical only if \(sign(\beta_j)\) matches \(s_j\). In \(\mathcal{A}\), we allow for the \(\beta_j\), where \(sign(\beta_j) \neq sign(\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta))\), while this is forbidden in \(\mathcal{B}\). </p></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-53" style="background:;">
  <hgroup>
    <h2>LAR</h2>
  </hgroup>
  <article>
    <h3>Some heuristic analysis</h3>

<ul>
<li> LAR requires that
\[|\mathbf{x}_k^T(\mathbf{y} - \mathbf{X}\beta)| \leq \gamma, \quad \forall k \not\in \mathcal{A}\]

<ul>
<li>\(\mathcal{A}\): set of the indices of the features with non-zero coefficients in LAR</li>
<li>\(\gamma = |\mathbf{x}_j^T(\mathbf{y} - \mathbf{X}\beta)|,\quad \forall j \in \mathcal{A}\)</li>
</ul></li>
<li>For LASSO, The stationary conditions require that
\[
|\mathbf{x}_k^T(\mathbf{y} - \mathbf{X}\beta)| \leq \lambda, \quad \forall k \not\in \mathcal{B}
\]

<ul>
<li>\(\mathcal{B}\): set of the indices of the features with non-zero coefficients in LASSO</li>
<li>\(\lambda\): regularization parameter</li>
</ul></li>
<li>LAR agrees with LASSO for variables with zero coefficients too.</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-54" style="background:;">
  <hgroup>
    <h2>LASSO</h2>
  </hgroup>
  <article>
    <ul>
<li>Introduction to Dimensionality Reduction</li>
<li>Linear Regression and Least Squares (Review)</li>
<li>Subset Selection</li>
<li><p><b>Shrinkage Methods</b></p>

<ul>
<li>Ridge Regression</li>
<li><p><b>LASSO</b></p>

<ul>
<li>Formulations</li>
<li>Comparisons with ridge regression and subset selection
Solution of LASSO</li>
<li><b>Viewed as approximation for \(l_0\)-regularization</b></li>
</ul></li>
</ul></li>
<li><p>Beyond LASSO</p></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-55" style="background:;">
  <hgroup>
    <h2>Viewed as approximation for \(l_0\)-regularization</h2>
  </hgroup>
  <article>
    <h3>Pure variable selection</h3>

<p>\[
\begin{equation}
\hat{\beta}^{ridge}= argmin_{\beta}\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p\mathbf{x_{ij}}\beta_j)^2, \quad s.t. \#nonzero \beta_j \leq t
\end{equation}
\]</p>

<p>Actually \(\#nonzero \beta_j = \|\beta\|_0\), where</p>

<p>\[\|\beta\|_0 = lim_{q \to 0}(\sum_{j = 1}^p|\beta_j|^q)^{\frac{1}{q}} = card(\quad \{\beta_j|\beta_j \neq 0\}\quad)\]</p>

<p><center><img src="assets/img/zeronorm.png" alt="zeronorm" title="zeronorm"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-56" style="background:;">
  <hgroup>
    <h2>Viewed as approximation for \(l_0\)-regularization</h2>
  </hgroup>
  <article>
    <h3>Problem</h3>

<p>\(l_0\)-norm is not convex, which makes it very hard to optimize.</p>

<h3>Solutions</h3>

<ul>
<li><b>LASSO</b>: Approximated objective function (\(l_1\)-norm), with exact optimization</li>
<li><b>Subset selection</b>: Exact objective function, with approximated optimization (greedy strategy)</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-57" style="background:;">
  <hgroup>
    <h2>Part 5: Beyond LASSO</h2>
  </hgroup>
  <article>
    <ol>
<li>Introduction to Dimensionality Reduction</li>
<li>Linear Regression and Least Squares (Review)</li>
<li>Subset Selection</li>
<li>Shrinkage Methods</li>
<li><b>Beyond LASSO</b>

<ul>
<li><b>Elastic-Net</b></li>
<li><b>Fused LASSO</b></li>
<li><b>Group LASSO</b><br></li>
<li><b>\(l_1-lp\) norm</b></li>
<li><b>Graph-guided LASSO</b></li>
</ul></li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-58" style="background:;">
  <hgroup>
    <h2>Beyond LASSO</h2>
  </hgroup>
  <article>
    <h3>Problems with LASSO</h3>

<ol>
<li>LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select <b>ALL</b> the relevant varibles in a group</li>
<li>LASSO selects at most \(N\) variables, when \(p > N\), which may be undesirable when \(p >> N\)</li>
<li>The performance of Ridge dominates that of LASSO, when \(N > p\) and variables are correlated</li>
<li>LASSO does not consider about the prior information of the structure over input or output variables</li>
</ol>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-59" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Elastic Net</h2>
  </hgroup>
  <article>
    <h3>Problems solved by E-Net</h3>

<ol>
<li>LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select <b>ALL</b> the relevant varibles in a group</li>
<li>LASSO selects at most \(N\) variables, when \(p > N\), which may be undesirable when \(p >> N\)</li>
<li>The performance of Ridge dominates that of LASSO, when \(N > p\) and variables are correlated</li>
</ol>

<h3>Elastic Net</h3>

  
<div class='left' style='float:left;width:41%'>
 <ul>
<li><b>Penalty Term</b>
\[\lambda \sum_{j = 1}^p (\alpha \beta_j^2 + (1-\alpha)|\beta_j|)\]
which is a compromise between ridge regression and LASSO and \(\alpha \in [0,1]\).</li>
</ul>


</div>    
<div class='right' style='float:right;width:55%'>
 <p><center><img src="assets/img/enet.png" alt="enet" title="enet"></center></p>

</div>
<div style='float:left;width:100%;' class='centered'>
  
</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-60" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Elastic Net</h2>
  </hgroup>
  <article>
    <h3>More Advantages of E-Net</h3>

<ul>
<li>selects variables like LASSO, and shrinks together the coefficients of correlated predictors like ridge.</li>
<li>has considerable computational advantages over the \(l_q\) penalties. 

<ul>
<li>See 18.4 [Elements of Statistical Learning]</li>
</ul></li>
</ul>

<h3>Elastic Net</h3>

  
<div class='left' style='float:left;width:41%'>
 <ul>
<li><b>Penalty Term</b>
\[\lambda \sum_{j = 1}^p (\alpha \beta_j^2 + (1-\alpha)|\beta_j|)\]
which is a compromise between ridge regression and LASSO and \(\alpha \in [0,1]\).</li>
</ul>


</div>    
<div class='right' style='float:right;width:55%'>
 <p><center><img src="assets/img/enet.png" alt="enet" title="enet"></center></p>

</div>
<div style='float:left;width:100%;' class='centered'>
  
</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-61" style="background:;">
  <hgroup>
    <h2>Elastic Net - A simple illustration</h2>
  </hgroup>
  <article>
    <ul>
<li>Two independent “hidden” factors \(\mathbf{z}_1\) and \(\mathbf{z}_2\)
\[\mathbf{z}_1 \sim U(0, 20),\quad \mathbf{z}_2 \sim U(0, 20),\]</li>
<li>Generate the response vector \(\mathbf{y} = \mathbf{z}_1 + 0.1\mathbf{z}_2 + N(0,1)\)</li>
<li>Suppose the observed features are
\[\mathbf{x}_1 = \mathbf{z}_1 + \epsilon_1,\quad \mathbf{x}_2 = -\mathbf{z}_1 + \epsilon_2,\quad \mathbf{x}_3 = \mathbf{z}_1 + \epsilon_3\]
\[\mathbf{x}_4 = \mathbf{z}_2 + \epsilon_4,\quad \mathbf{x}_5 = -\mathbf{z}_2 + \epsilon_5,\quad \mathbf{x}_6 = \mathbf{z}_2 + \epsilon_6\]
where \(\epsilon\) is \(i.i.d.\) random noise.</li>
<li>Fit the model on data \((\mathbf{X}, \mathbf{y})\)</li>
<li>A good model should identify that only \(\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3\) are important</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-62" style="background:;">
  <hgroup>
    <h2>Elastic Net - A simple illustration</h2>
  </hgroup>
  <article>
    <p><center><img src="assets/img/enet_lasso.png" alt="enet_LASSO" title="enet_lasso"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-63" style="background:;">
  <hgroup>
    <h2>Elastic Net - A simple illustration</h2>
  </hgroup>
  <article>
    <p><center><img src="assets/img/enet_re.png" alt="enet_LASSO" title="enet_re"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-64" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Fused Lasso</h2>
  </hgroup>
  <article>
    <h3>Problems with LASSO</h3>

<ol>
<li>LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select <b>ALL</b> the relevant varibles in a group</li>
<li>LASSO selects at most \(N\) variables, when \(p > N\), which may be undesirable when \(p >> N\)</li>
<li>The performance of Ridge dominates that of LASSO, when \(N > p\) and variables are correlated</li>
<li>LASSO does not consider about the prior information of the structure over input or output variables</li>
</ol>

<p>Fused LASSO can solve the \(4th\) problem for a specific kind of prior.</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-65" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Fused LASSO</h2>
  </hgroup>
  <article>
    <h3>Fused LASSO</h3>

<ul>
<li><b>Intuition</b>

<ul>
<li>Fused LASSO is designed for problems with features that can be ordered in some meaningful way, where &quot;adjacent features&quot; should have similar importance</li>
<li>Fused LASSO penalizes the \(L_1\)-norm of both the coefﬁcients and their successive differences</li>
</ul></li>
<li><b>Example</b>

<ul>
<li>Classification with fMRI data: each voxel has about 200 measurements over time. The coefficients for adjacent voxels should be similar</li>
</ul></li>
<li><b>Formulation</b></li>
</ul>

<p>\[\hat{\beta} = argmin_{\beta}\{\|\mathbf{X\beta - y}\|_2^2\}\]
\[s.t. \|\beta\| \leq s_1 \quad and \quad \sum_{j = 2}^p |\beta_j - \beta_{j-1}| \leq s_2\]</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-66" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Fused LASSO</h2>
  </hgroup>
  <article>
    <h3>Fused LASSO</h3>

<p><center><img src="assets/img/nb_fs.png" alt="nb_fs" title="nb_fs"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-67" style="background:;">
  <hgroup>
    <h2>Fused LASSO - Simulation results</h2>
  </hgroup>
  <article>
    <p><center><img src="assets/img/flasso.png" alt="fLASSO" title="fLASSO"></center></p>

<ul>
<li>\(p = 100\). Black lines are the true coefficients.</li>
<li>(a) Univariate regression coefficients (red), a soft threshold version of them (green)</li>
<li>(b) LASSO solution (red), \(s_1 = 35.6,\quad s_2 = \infty\)</li>
<li>(c) Fusion estimate, \(s_1 = \infty, \quad s_2 = 26\)</li>
<li>(d) Fused LASSO, \(s_1 = \sum |\beta_j|,\quad s_2 = \sum |\beta_j - \beta_{j-1}|\)</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-68" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Group LASSO</h2>
  </hgroup>
  <article>
    <h3>Problems with LASSO</h3>

<ol>
<li>LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select <b>ALL</b> the relevant varibles in a group</li>
<li>LASSO selects at most \(N\) variables, when \(p > N\), which may be undesirable when \(p >> N\)</li>
<li>The performance of Ridge dominates that of LASSO, when \(N > p\) and variables are correlated</li>
<li>LASSO does not consider about the prior information of the structure over input or output variables</li>
</ol>

<p>Group LASSO can solve the \(1st\) and the \(4th\) problem for a specific kind of prior</p>

<ul>
<li>Differences in the way Group LASSO and Elastic Net solve the \(1st\) Problem

<ul>
<li>Group LASSO: Prior information about group structures is needed</li>
<li>Elastic Net: Prior information is not needed and the algorithm detects the group itself</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-69" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Group LASSO</h2>
  </hgroup>
  <article>
    <h3>Group LASSO</h3>

<ul>
<li><b>Intuition</b>

<ul>
<li>Features are divided into \(L\) groups</li>
<li>Features within the same group should share similar coefficients</li>
</ul></li>
<li><b>Example</b>

<ul>
<li>Binary dummy variables from one single discrete variable, e.g. \(stage\_cancer \in \{1,2,3\}\) can be translated into three binary dummy variables \((stage1, stage2, stage3)\) </li>
</ul></li>
<li><b>Formulations</b>
\[obj = \left\|\mathbf{y} - \sum_{l = 1}^L \mathbf{X}_l \beta_l \right\|_2^2 + \lambda_1 \sum_{l = 1}^L\left\|\beta_l\right\|_2 + \lambda_2 \left\|\beta\right\|_1\]</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-70" style="background:;">
  <hgroup>
    <h2>Group LASSO - Simulation Results</h2>
  </hgroup>
  <article>
    <ul>
<li>Generate \(n = 200\) observations with \(p = 100\), divided into ten blocks equally</li>
<li>The number of non-zero coefficients in blocks are 
<center><img src="assets/img/blocks.png" alt="bl" title="block"></center></li>
<li>The coefficients are either -1 or +1, with the sign being chosen randomly.</li>
<li>The predictors are standard Gaussian with correlation 0.2 within a group and zero otherwise</li>
<li>A Gaussian noise with standard deviation 4.0 was added to each observation</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-71" style="background:;">
  <hgroup>
    <h2>Group LASSO - Simulation Results</h2>
  </hgroup>
  <article>
    <p><center><img src="assets/img/gl.png" alt="gl" title="gl"></center></p>

<p>Group structures are not discovered by LASSO.</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-72" style="background:;">
  <hgroup>
    <h2>Group LASSO - Simulation Results</h2>
  </hgroup>
  <article>
    <p><center><img src="assets/img/gl2.png" alt="gl2" title="gl2"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-73" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - \(l_1-l_p\) penalization</h2>
  </hgroup>
  <article>
    <h3>Problems with LASSO</h3>

<ol>
<li>LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select <b>ALL</b> the relevant varibles in a group</li>
<li>LASSO selects at most \(N\) variables, when \(p > N\), which may be undesirable when \(p >> N\)</li>
<li>The performance of Ridge dominates that of LASSO, when \(N > p\) and variables are correlated</li>
<li>LASSO does not consider about the prior information of the structure over input or output variables</li>
</ol>

<p>\(l_1\)-\(l_p\) penalization solves the \(4th\) problem by dealing with prior information of structures over output variables</p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-74" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - \(l_1\)-\(l_p\) penalization</h2>
  </hgroup>
  <article>
    <h3>\(l_1\)-\(l_p\) penalization</h3>

<ul>
<li><b>Applies to multi-task learning</b>, where the goal is to estimate predictive models for several related tasks. </li>
<li><b>Examples</b>

<ul>
<li><b>Example 1</b>: recognize speech of different speakers, or handwriting of different writers, </li>
<li><b>Example 2</b>: learn to control a robot for grasping different objects</li>
<li><b>Example 3</b>:learn to control a robot for driving in different landscapes </li>
</ul></li>
<li><b>Assumptions about the tasks</b>

<ul>
<li>sufficiently <i>different</i> that learning a specific model for each task results in improved performance</li>
<li><i>similar</i> enough that they share some common underlying representation that should make simul- taneous learning beneficial. </li>
<li>different tasks share a subset of relevant features selected from a large common space of features.</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-75" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - \(l_1\)-\(l_p\) penalization</h2>
  </hgroup>
  <article>
    <h3>\(l_1\)-\(l_p\) penalization</h3>

<ul>
<li><b>Formulation</b>

<ul>
<li>\(\mathbf{X}_l\): \(N \times p\) input matrix for task \(l = 1..L\)

<ul>
<li>\(L\) is the total number of tasks</li>
</ul></li>
<li>\(\beta\): \(p \times L\) coefficient matrix</li>
<li>\(\mathbf{y}\): \(N \times L\) output matrix</li>
<li>objective function
\[obj = \sum_{l= 1}^L J(\beta_{:l}, \mathbf{X}_l, \mathbf{y}_{:l}) + \lambda \sum_{j = 1}^p \|\beta_{j:}\|_2\]
where \(J\) is some loss function and \(\sum_{j = 1}^p \|\beta_{:j}\|_2\) is the \(l_1\) norm of vector \((\|\beta_{:1}\|_2, \|\beta_{:2}\|_2, ..., \|\beta_{:p}\|_2)\).</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-76" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - \(l_1-l_p\) penalization</h2>
  </hgroup>
  <article>
    <h3>\(l_1-l_p\) penalization -Coefficient matrix</h3>

<p><center><img src="assets/img/l1lp.png" alt="l1lp" title="l1lp"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-77" style="background:;">
  <hgroup>
    <h2>\(l_1-l_p\) penalization - Experiment Result</h2>
  </hgroup>
  <article>
    <ul>
<li><b>Dataset</b>: handwritten words dataset collected by Rob Kassel

<ul>
<li>Contains writings from more than 180 different writers.</li>
<li>For each writer, the number of each letter we have is between 4 and 30</li>
<li>The letters are originally represented as \(8 \times 16\)</li>
</ul></li>
<li><b>Task</b>: build binary classiers that discriminate between pairs of letters. Specically concentrat on the pairs of letters that are the most difficult to distinguish when written by hand.<br></li>
<li><b>Experiment</b>: learned classications of 9 pairs of letters for 40 different writers</li>
</ul>

<p><center><img src="assets/img/write.png" alt="write" title="write"></center></p>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-78" style="background:;">
  <hgroup>
    <h2>\(l_1-l_p\) penalization - Experiment Result</h2>
  </hgroup>
  <article>
    <ul>
<li><b>Candidate methods</b>

<ul>
<li>Pool \(l_1\): a classifier is trained on all data regardless of writers</li>
<li>Independent \(l_1\) regularization: For each writer, a classifier is trained</li>
<li>\(l_1/l_1\)-regularization:
\[obj = \sum_{l= 1}^L J(\beta_{:l}, \mathbf{X}_l, \mathbf{y}_{:l}) + \lambda \sum_{l = 1}^L \|\beta_{:l}\|_1\]

<ul>
<li>purely adding up the objective functions with \(l_1\) regularization of \(L\) tasks</li>
</ul></li>
<li>\(l_1/l_2\)-regularization:
\[obj = \sum_{l= 1}^L J(\beta_{:l}, \mathbf{X}_l, \mathbf{y}_{:l}) + \lambda \sum_{j = 1}^p \|\beta_{j:}\|_2\]<br></li>
</ul></li>
</ul>

<hr>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-79" style="background:;">
  <hgroup>
    <h2>\(l_1-l_p\) penalization - Experiment Result</h2>
  </hgroup>
  <article>
    <p><center><img src="assets/img/l1lp_re.png" alt="l1lp_re" title="l1lp_re"></center></p>

<ul>
<li>Within a cell,  the first row contains results for feature selection, the second row uses random projections to obtain a common subspace (details omitted, see paper: Multi-task feature selection)</li>
<li>Bold: best of \(l_1/l_2\),\(l_1/l_1\), \(sp.l_1\) or pooled \(l_1\), Boxed : best of cell</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-80" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - Graph-Guided Fused Lasso</h2>
  </hgroup>
  <article>
    <h3>Problems with LASSO</h3>

<ol>
<li>LASSO tends to rather arbitrarily select one of a group of highly correlated variables (see how LAR works). Sometimes, it is better to select <b>ALL</b> the relevant varibles in a group</li>
<li>LASSO selects at most \(N\) variables, when \(p > N\), which may be undesirable when \(p >> N\)</li>
<li>The performance of Ridge dominates that of LASSO, when \(N > p\) and variables are correlated</li>
<li>LASSO does not consider about the prior information of the structure over input or output variables</li>
</ol>

<p>Graph-Guided Fused LASSO (GFLASSO) solves the \(4th\) problem by dealing with prior information of structures over output variables</p>

<ul>
<li>More general than \(l_1/l_p\), as abitrary graphical structures over the output variables can be encoded as priors in GFLASSO</li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-81" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - GFLASSO</h2>
  </hgroup>
  <article>
    <h3>Graph-Guided Fused LASSO</h3>

<ul>
<li><b>Example</b>
<center><img src="assets/img/gflasso.png" alt="gfLASSO" title="gfLASSO"></center></li>
<li><b>Formulation</b>
Graph-Guided LASSO applies to multi-task settings
\[obj = \sum_{l= 1}^L loss(\beta_{:l}, \mathbf{X}_l, \mathbf{y}_{:l}) + \lambda \|\beta\|_1+\gamma \sum_{(a,b)\in E}^p \tau(r_{ab}) \sum_{j = 1}^p |\beta_{ja} - sign(r_{a,b})\beta_{jb}|\]
where \(r_{a,b} \in \mathbb{R}\) denotes the weight of the edge and \(\tau(r)\) can be any user specified positive monotonically increasing function of \(|r|\)

<ul>
<li>e.g. \(\tau(r) = |r|\).</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-82" style="background:;">
  <hgroup>
    <h2>Beyond LASSO - GFLASSO</h2>
  </hgroup>
  <article>
    <h3>GFLASSO</h3>

<p><center><img src="assets/img/gflasso_re.png" alt="gfLASSO_re" title="gfLASSO_re"></center></p>


<div style='float:left;width:48%;' class='centered'>
  <ul>
<li>(a) The true regression coefficients</li>
<li>(c) \(l_1/l_2\)-regularized multi-task regression</li>
</ul>


</div>
<div style='float:right;width:48%;'>
  <ul>
<li>(b) LASSO</li>
<li>(d) GFLASSO</li>
</ul>

</div>
  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-83" style="background:;">
  <hgroup>
    <h2>Summary</h2>
  </hgroup>
  <article>
    <h3>Outline</h3>

<ul>
<li><h3>Introduction to Dimensionality Reduction</h3></li>
<li><h3>Linear Regression and Least Squares (Review)</h3></li>
<li><h3>Subset Selection</h3></li>
<li><h3>Shrinkage Methods</h3></li>
<li><h3>Beyond LASSO</h3></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-84" style="background:;">
  <hgroup>
    <h2>Summary</h2>
  </hgroup>
  <article>
    <ul>
<li>Feature selection vs feature extraction

<ul>
<li>Feature selection: can save cost, be interpreted</li>
<li>Feature extraction: more general, often leads to better performance</li>
</ul></li>
<li>Linear models: Least Squares, Subset Selection, Ridge, LASSO:

<ul>
<li>Least Squares is unbiased, but can have high variance (as includes all features)</li>
<li>Ridge (\(l_2\) regularization): constrains parameter values, to reduce variance</li>
<li>Subset Selection, LASSO: finds subset of features (to reduce variance)</li>
<li>LASSO uses \(l_1\) regularization </li>
</ul></li>
<li>LAR is like LASSO ( (\(l_1\) regularization), but 

<ul>
<li>Their behaviors become different, when an coefficient hits zero</li>
<li>Modification: drops the feature, when its coefficient hits zero</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-85" style="background:;">
  <hgroup>
    <h2>Summary II</h2>
  </hgroup>
  <article>
    <ul>
<li>QP solves LASSO for a single \(\lambda\), while LAR can solve LASSO for all \(\lambda\)</li>
<li>Bayesian prior interpretation for Subset Selection, Ridge and LASSO</li>
<li>Beyond LASSO (all use L1 regularization)

<ul>
<li>Elastic Net -- both L1 and L2</li>
<li>fused LASSO: coefficients of adjacent features are similar</li>
<li>group LASSO: feautures share similar coefficients within groups</li>
<li>\(l_1/l_2\): similarities in multi-task</li>
<li>GFlasso: incorporates structure on output variables</li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-86" style="background:;">
  <hgroup>
    <h2>More on the topics skipped here</h2>
  </hgroup>
  <article>
    <ul>
<li>More on feature extraction methods: 

<ul>
<li><a href="http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf">http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf</a></li>
<li>Imola K. Fodor, A survey of Dimensionality Reduction techniques</li>
<li>Christopher J. C. Burges, Dimensionality Reduction: A Guided Tour</li>
</ul></li>
<li>Mutual-info-based feature selection: 

<ul>
<li>Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján; Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection</li>
<li>Howard Hua Yang, John Moody. Feature Selection Based on Joint Mutual Information</li>
<li>Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy</li>
</ul></li>
<li>Beyond LASSO

<ul>
<li><a href="http://webdocs.cs.ualberta.ca/%7Emahdavif/ReadingGroup/">http://webdocs.cs.ualberta.ca/~mahdavif/ReadingGroup/</a></li>
</ul></li>
<li>ELEN E6898 Sparse Signal Modeling 

<ul>
<li><a href="https://sites.google.com/site/eecs6898sparse2011/home">https://sites.google.com/site/eecs6898sparse2011/home</a></li>
</ul></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-87">
<hgroup>
  <h2>Sparse Models</h2>
</hgroup>
<article class = 'flexbox vcenter'>
<h3>Thank You!</h3>

</article>
<!-- Presenter Notes -->
</slide>
      <slide class="" id="slide-88" style="background:;">
  <hgroup>
    <h2>Reference</h2>
  </hgroup>
  <article>
    <ul>
<li>Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning <font color = 'green'>[p7, p15, p16, p18, p19, p21-22, p26-27, p29-30, p33, p35-37, p42-p43, p50-p54, p56, p59]</font></li>
<li>Temporal Sequence of FMRI scans (single slice): from <a href="http://www.midwest-medical.net/mri.sagittal.head.jpg">http://www.midwest-medical.net/mri.sagittal.head.jpg</a> <font color = 'green'>[p8]</font></li>
<li>Three Dimensional Image of Brain Activation from <a href="http://www.fmrib.ox.ac.uk/fmri_intro/brief.html">http://www.fmrib.ox.ac.uk/fmri_intro/brief.html</a> <font color = 'green'>[p8]</font></li>
<li><a href="http://en.wikipedia.org/wiki/Feature_selection">http://en.wikipedia.org/wiki/Feature_selection</a> <font color = 'green'>[p10-12]</font></li>
<li><a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">http://en.wikipedia.org/wiki/Singular_value_decomposition</a> <font color = 'green'>[p27]</font></li>
<li><a href="http://en.wikipedia.org/wiki/Normal_distribution">http://en.wikipedia.org/wiki/Normal_distribution</a> <font color = 'green'>[p38]</font></li>
<li><a href="http://en.wikipedia.org/wiki/Laplacian_distribution">http://en.wikipedia.org/wiki/Laplacian_distribution</a> <font color = 'green'>[p38]</font></li>
<li><a href="http://webdocs.cs.ualberta.ca/%7Emahdavif/ReadingGroup/Papers/larS.pdf">http://webdocs.cs.ualberta.ca/~mahdavif/ReadingGroup/Papers/larS.pdf</a> <font color = 'green'>[p20]</font></li>
<li>Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Least Angle Regression <font color = 'green'>[p20]</font></li>
<li><a href="http://www.stanford.edu/%7Ehastie/TALKS/larstalk.pdf">http://www.stanford.edu/~hastie/TALKS/larstalk.pdf</a></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

      <slide class="" id="slide-89" style="background:;">
  <hgroup>
    <h2>Reference</h2>
  </hgroup>
  <article>
    <ul>
<li>Kevin P. Murphy. Machine Learning A Probabilistic Perspective<font color = 'green'>[p59]</font></li>
<li>Prof.Schuurmans&#39; notes on LASSO <font color = 'green'>[p40]</font></li>
<li>Conditional Likelihood Maximisation: A Unifying Framework for
Information Theoretic Feature Selection <font color = 'green'>[p8]</font></li>
<li>Hui Zou and Trevor Hastie. Regularization and Variable Selection via the Elastic Net <font color = 'green'>[p59-62]</font></li>
<li><a href="http://www.stanford.edu/%7Ehastie/TALKS/enet_talk.pdf">http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf</a> <font color = 'green'>[p59-62]</font></li>
<li>Robert Tibshirani and Michael Saunders, Sparsity and smoothness via the fused LASSO <font color = 'green'>[P63-p65]</font></li>
<li>Jerome Friedman Trevor Hastie and Robert Tibshirani. A note on the group LASSO and a sparse group LASSO <font color = 'green'>[p66-68]</font></li>
<li>Guillaume Obozinski, Ben Taskar, and Michael Jordan. Multi-task feature selection <font color = 'green'>[p69-70, p72-p75]</font></li>
<li>Xi Chen, Seyoung Kim, Qihang Lin, Jaime G. Carbonell, Eric P. Xing. Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused LASSO <font color = 'green'>[p76-77]</font></li>
</ul>

  </article>
  <!-- Presenter Notes -->
</slide>

    <slide class="backdrop"></slide>
  </slides>

  <!--[if IE]>
    <script 
      src="http://ajax.googleapis.com/ajax/libs/chrome-frame/1/CFInstall.min.js">  
    </script>
    <script>CFInstall.check({mode: 'overlay'});</script>
  <![endif]-->
</body>
<!-- Grab CDN jQuery, fall back to local if offline -->
<script src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.7.min.js"></script>
<script>window.jQuery || document.write('<script src="libraries/widgets/quiz/js/jquery-1.7.min.js"><\/script>')</script>
<!-- Load Javascripts for Widgets -->
<!-- MathJax: Fall back to local if CDN offline but local image fonts are not supported (saves >100MB) -->
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    tex2jax: {
      inlineMath: [['$','$'], ['\\(','\\)']],
      processEscapes: true
    }
  });
</script>
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/2.0-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<!-- <script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/2.0-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script> -->
<script>window.MathJax || document.write('<script type="text/x-mathjax-config">MathJax.Hub.Config({"HTML-CSS":{imageFont:null}});<\/script><script src="libraries/widgets/mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML"><\/script>')
</script>
<!-- LOAD HIGHLIGHTER JS FILES -->
<script src="libraries/highlighters/highlight.js/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<!-- DONE LOADING HIGHLIGHTER JS FILES -->
</html>