diff --git a/cp b/cp
new file mode 100644
index 0000000..295ce1a
Binary files /dev/null and b/cp differ
diff --git a/mlm b/mlm
new file mode 100644
index 0000000..dd84aff
Binary files /dev/null and b/mlm differ
diff --git a/model.r b/model.r
index 40201fd..9d6be05 100644
--- a/model.r
+++ b/model.r
@@ -622,3 +622,29 @@ mlm <- glmer(desert ~ CTA_counts + crime + vacant_counts +
print(paste('AIC mlm:', AIC(mlm)))
+
+summary(glm(desert ~ Birth.Rate +
+ General.Fertility.Rate +
+ Low.Birth.Weight +
+ Prenatal.Care.Beginning.in.First.Trimester +
+ Preterm.Births +
+ Teen.Birth.Rate +
+ Assault..Homicide. +
+ Breast.cancer.in.females +
+ Cancer..All.Sites. +
+ Colorectal.Cancer +
+ Diabetes.related +
+ Firearm.related +
+ Infant.Mortality.Rate +
+ Lung.Cancer +
+ Prostate.Cancer.in.Males +
+ Stroke..Cerebrovascular.Disease. +
+ Tuberculosis +
+ Below.Poverty.Level +
+ Crowded.Housing +
+ Dependency +
+ No.High.School.Diploma +
+ Per.Capita.Income +
+ Unemployment,
+ data = model_data_scale,
+ family = 'binomial'))
diff --git a/np b/np
new file mode 100644
index 0000000..a76dd55
Binary files /dev/null and b/np differ
diff --git a/paper.aux b/paper.aux
index fba9a62..b1d5010 100644
--- a/paper.aux
+++ b/paper.aux
@@ -24,7 +24,7 @@
\newlabel{ppresult}{{3}{8}}
\@writefile{lot}{\contentsline {table}{\numberline {4}{\ignorespaces Hierarchical Model Summary}}{8}}
\newlabel{mlm}{{4}{8}}
-\@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces Model AICs}}{9}}
-\newlabel{AICs}{{5}{9}}
-\@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Model Cross Validated MSEs}}{9}}
-\newlabel{MSEs}{{6}{9}}
+\@writefile{lot}{\contentsline {table}{\numberline {5}{\ignorespaces Model AICs}}{8}}
+\newlabel{AICs}{{5}{8}}
+\@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Model Cross Validated MSEs}}{8}}
+\newlabel{MSEs}{{6}{8}}
diff --git a/paper.log b/paper.log
index ba90791..d7d47b9 100644
--- a/paper.log
+++ b/paper.log
@@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22) 22 NOV 2016 15:25
+This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) (preloaded format=pdflatex 2016.5.22) 22 NOV 2016 18:30
entering extended mode
restricted \write18 enabled.
file:line:error style messages enabled.
@@ -306,7 +306,7 @@ Underfull \hbox (badness 10000) in paragraph at lines 104--105
Here is how much of TeX's memory you used:
2630 strings out of 493014
36230 string characters out of 6133351
- 113228 words of memory out of 5000000
+ 115228 words of memory out of 5000000
6147 multiletter control sequences out of 15000+600000
9369 words of font info for 34 fonts, out of 8000000 for 9000
1141 hyphenation exceptions out of 8191
@@ -316,18 +316,20 @@ r/local/texlive/2016/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx10.pfb>
-Output written on paper.pdf (11 pages, 7826959 bytes).
+cal/texlive/2016/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi5.pfb>
+Output written on paper.pdf (11 pages, 7843038 bytes).
PDF statistics:
- 99 PDF objects out of 1000 (max. 8388607)
- 63 compressed objects within 1 object stream
+ 107 PDF objects out of 1000 (max. 8388607)
+ 69 compressed objects within 1 object stream
0 named destinations out of 1000 (max. 500000)
26 words of extra memory for PDF output out of 10000 (max. 10000000)
diff --git a/paper.pdf b/paper.pdf
index b948dbc..528e10e 100644
Binary files a/paper.pdf and b/paper.pdf differ
diff --git a/paper.tex b/paper.tex
index 05a1681..e3ae6dc 100644
--- a/paper.tex
+++ b/paper.tex
@@ -103,7 +103,6 @@ \subsubsection*{Neighborhood level data}
\paragraph{ Race by Community Area }
This file contains a record for every neighborhood in Chicago with the number of residents of each race who reside in that neighborhood. \\
-We tried to gather data on crimes and use that information in the model, however the available dataset for crimes in chicago is rather large ($>$1 GB) and we didn't have time to finish extracting features from that model. We hypothesized that food deserts were more likely to be in high crime areas.
\subsection*{Generalized Linear Models}
@@ -124,33 +123,35 @@ \subsubsection*{Complete Pooling}
To begin we have the simplest model: ordinary regression using only the block-level variables. This model pools together every neighborhood as if the neighborhood distinctions don't matter.
-$$ y_{ij} = \text{logit}^{-1}\left( \alpha + X_{B}\beta_{B} + \epsilon_{ij} \right) $$
+$$ y_{ij} = \text{logit}^{-1}\left( \alpha + X_{B}\beta_{B} \right) $$
-Where $\epsilon_{ij} \sim N(0, \sigma^2)$
+% Where $\epsilon_{ij} \sim N(0, \sigma^2)$
\subsubsection*{No Pooling}
The next model has a different but nonrandom intercept for each neighborhood, a fixed effect for that neighborhood. This would correspond to our belief that the neighborhoods are each different from the others.
-$$ y_{i} = \text{logit}^{-1}\left( \alpha + X_{B}\beta_{B} + \gamma_j + \epsilon_{ij} \right) $$
+$$ y_{i} = \text{logit}^{-1}\left( \alpha + X_{B}\beta_{B} + \gamma_j \right) $$
-Where $\epsilon_i \sim N(0, \sigma^2)$
+% Where $\epsilon_i \sim N(0, \sigma^2)$
\subsubsection*{Partial pooling}
The next model has a random intercept for each neighborhood which corresponds to partially pooling the data together. For every neighborhood we use some of the information in other neighborhoods to estimate its intercept. That is, the intercepts in the previous model are shrunk toward the common mean.
-$$ y_{i} = \text{logit}^{-1}\left( \alpha_{j[i]} + X_{B}\beta_{B} + \epsilon_i \right) $$
+$$ y_{i} = \text{logit}^{-1}\left( \alpha_{j[i]} + X_{B}\beta_{B} \right) $$
-Where $\epsilon_i \sim N(0, \sigma^2)$ and $\alpha_j \sim N(0, \sigma^2_\alpha)$
+Where % $\epsilon_i \sim N(0, \sigma^2)$ and
+$\alpha_j \sim N(\mu_\alpha, \sigma^2_\alpha)$
\subsubsection*{Hierarchical}
The final and most complicated model that was fit was a hierarchical model including the neighborhood level predictors in estimating the random intercept for each neighborhood.
-$$ y_{i} = \text{logit}^{-1}\left( \alpha_{j[i]} + X_{B}\beta_{B} + \epsilon_i \right) $$
+$$ y_{i} = \text{logit}^{-1}\left( \alpha_{j[i]} + X_{B}\beta_{B} \right) $$
-Where $\epsilon_i \sim N(0, \sigma^2)$ and $\alpha_j \sim N(X_N \beta_N, \sigma^2_\alpha)$
+Where % $\epsilon_i \sim N(0, \sigma^2)$ and
+$\alpha_j \sim N(X_N \beta_N, \sigma^2_\alpha)$
\subsection*{Model Comparison}
@@ -544,6 +545,8 @@ \subsection*{Model Comparison}
\end{tabular}
\end{table}
+The variance ratio $\frac{\sigma^2_alpha}{\sigma^2_y}$ in the hierarchical model is approximately 2.97 which means that for a neighborhood of more than 1/2.97 $\approx$ .33 city blocks, the within neighborhood model is more informative. This indicates that pooling and hierarchical structure may not be completely effective.
+
\section*{Conclusions}
In terms of cross validated accuracy: the hierarchical model was more accurate on average on new city blocks than the other 3 models indicating support for the hierarchical structure of the data. However, the evidence was not as strong as the author would have liked. Consider the model summary in table \ref{mlm}. We see that food deserts tend to be located in neighborhoods with higher incidences of all site cancer. Perhaps surprisingly, in the prescence of the other information, a block in a neighborhood with higher incidences of diabetes was less likely to be in a food desert. City blocks in neighborhoods that are more populous (TOTAL.POPULATION) are less likely to be food deserts. Finally, blocks in neighborhoods with higher rates of dependency (\% of the population younger than 18 or older than 64) are more likely to be in food deserts.
@@ -555,5 +558,6 @@ \section*{Conclusions}
\section*{Future Work}
Some issues due to not having data from grocery stores outside the city limits, could affect food desert status of city blocks near the borders.
+Additionally: the model isn't great and much more work would have to be done in order to make it useful.
\end{document}
\ No newline at end of file
diff --git a/pp b/pp
new file mode 100644
index 0000000..74ec993
Binary files /dev/null and b/pp differ
diff --git a/presentation.Rmd b/presentation.Rmd
new file mode 100644
index 0000000..7d50f9f
--- /dev/null
+++ b/presentation.Rmd
@@ -0,0 +1,133 @@
+---
+title: "Food Deserts in Chicago"
+author: "Daniel Berry"
+output: revealjs::revealjs_presentation
+
+---
+
+# Introduction
+
+## What is a Food Desert?
+- In general: a place where it is more difficult to access healthy food
+- Defiition used for this work: a city block located more than 1 mile from a supermarket
+ - Supermarket defined as a grocery store larger than 10000 sq ft
+ - Distance is great circle distance between center of grocery store and center of city block
+ - Other definitions exist that also cover rural areas and take into account car ownership (harder to travel w/o a car)
+
+## Where are food deserts in Chicago
+
+```{r, out.width = "600px", echo = FALSE}
+knitr::include_graphics("deserts_plot.png")
+```
+
+# Chicago Demographics
+
+## Distribution of Black People
+
+```{r, out.width = "600px", echo = FALSE}
+knitr::include_graphics("pct_black_plot.png")
+```
+
+## Distribution of White People
+
+```{r, out.width = "600px", echo = FALSE}
+knitr::include_graphics("pct_white_plot.png")
+```
+
+## Income
+
+```{r, out.width = "600px", echo = FALSE}
+knitr::include_graphics("income_plot.png")
+```
+
+## Vacancy
+```{r, out.width = "600px", echo = FALSE}
+knitr::include_graphics("vacant_plot.png")
+```
+
+# Data
+
+## Source
+All data from the [Chicago Open Data Portal](https://data.cityofchicago.org). Several files:
+
+- Crimes 2001 - Present
+- 311 Service Requests - Vacant Buildings
+- CTA Ridership Avg Weekly Boardings Oct 2010
+- City Block Population
+- Public Health Statistics Selected Indicators
+- Census Data: Selected Socioeconomic Indicators
+- Race by Community Area
+
+# Models
+
+## Complete Pooling
+
+$$ y_{ij} = \text{logit}^{-1}\left( \alpha + X_{B}\beta_{B} \right) $$
+
+## No Pooling
+
+$$ y_{i} = \text{logit}^{-1}\left( \alpha + X_{B}\beta_{B} + \gamma_j \right) $$
+
+## Partial Pooling
+
+$$ y_{i} = \text{logit}^{-1}\left( \alpha_{j[i]} + X_{B}\beta_{B} \right) $$
+
+Where $\alpha_j \sim N(\mu_\alpha, \sigma^2_\alpha)$
+
+## Hierarchical
+
+$$ y_{i} = \text{logit}^{-1}\left( \alpha_{j[i]} + X_{B}\beta_{B} + \epsilon_i \right) $$
+Where $\alpha_j \sim N(X_N \beta_N, \sigma^2_\alpha)$
+
+# Model summaries
+
+## Complete Pooling
+
+```{r, echo = F}
+library(lme4)
+load('cp')
+summary(cp)
+```
+
+## No Pooling
+
+```{r, echo = F}
+load('np')
+summary(np)
+```
+
+## Partial Pooling
+
+```{r, echo = F}
+load('pp')
+summary(pp)
+```
+
+## Hierarchical
+
+```{r, echo = F}
+load('mlm')
+summary(mlm)
+```
+
+# Results
+
+## Was pooling effective?
+
+- Variance ratio $\approx 3$ indicates much higher variability within a neighborhood than between neighborhoods.
+
+- AICs for random intercept models were higher than no pooling model.
+
+- Cross validated MSEs (or [Brier Scores](https://en.wikipedia.org/wiki/Brier_score)) were a way to quantify how accurate we are on previously unseen city blocks within a neighborhood:
+ - Complete Pooling: 0.07382216
+ - No Pooling: 0.05328587
+ - Partial Pooling: 0.05329956
+ - Hierarchical: 0.05323632
+
+## Thoughts
+
+- Models are an improvement over just using neighborhood level variables, but I'm not convinced that the hierarchical model is an improvement over simpler no pooling or partial pooling.
+
+- Unfortunately this project doesn't really give us any more information about the causes of food deserts that we didn't really know before. More importantly this project doesn't help resolve the issue at all.
+
+# Questions?
diff --git a/presentation.html b/presentation.html
new file mode 100644
index 0000000..b49c13a
--- /dev/null
+++ b/presentation.html
@@ -0,0 +1,497 @@
+
+
+
Variance ratio \(\approx 3\) indicates much higher variability within a neighborhood than between neighborhoods.
+
AICs for random intercept models were higher than no pooling model.
+
Cross validated MSEs (or Brier Scores) were a way to quantify how accurate we are on previously unseen city blocks within a neighborhood:
+
+
Complete Pooling: 0.07382216
+
No Pooling: 0.05328587
+
Partial Pooling: 0.05329956
+
Hierarchical: 0.05323632
+
+
+
+
Thoughts
+
+
Models are an improvement over just using neighborhood level variables, but I’m not convinced that the hierarchical model is an improvement over simpler no pooling or partial pooling.
+
Unfortunately this project doesn’t really give us any more information about the causes of food deserts that we didn’t really know before. More importantly this project doesn’t help resolve the issue at all.