diff --git a/.gitignore b/.gitignore index d4e7113..0c10aba 100644 --- a/.gitignore +++ b/.gitignore @@ -1,8 +1,8 @@ inst/ -Meta .Rproj.user .Rhistory .RData *.el notes.org doc +/Meta/ diff --git a/R/FRD_lp.R b/R/FRD_lp.R index cca1145..79aae6e 100644 --- a/R/FRD_lp.R +++ b/R/FRD_lp.R @@ -128,10 +128,10 @@ FRDHonest <- function(formula, data, subset, weights, cutoff=0, M, #' class \code{"RDBW"} is a list containing the following components: #' #' \describe{ -#' \item{\code{hp}}{bandwidth for observations above cutoff} +#' \item{\code{hp}}{bandwidth for observations weakly above cutoff} #' -#' \item{\code{hm}}{bandwidth for observations below cutoff, equal to -#' \code{hp} unless \code{bw.equal==FALSE}} +#' \item{\code{hm}}{bandwidth for observations strictly below cutoff, equal +#' to \code{hp} unless \code{bw.equal==FALSE}} #' #' \item{\code{sigma2m}, \code{sigma2p}}{estimate of conditional variance #' just above and just below cutoff, \eqn{\sigma^2_+(0)} and diff --git a/R/RD_lp.R b/R/RD_lp.R index 6278c60..93b0bee 100644 --- a/R/RD_lp.R +++ b/R/RD_lp.R @@ -135,9 +135,9 @@ RDHonest <- function(formula, data, subset, weights, cutoff=0, M, #' class \code{"RDBW"} is a list containing the following components: #' #' \describe{ -#' \item{\code{hp}}{bandwidth for observations above cutoff} +#' \item{\code{hp}}{bandwidth for observations strictly above cutoff} #' -#' \item{\code{hm}}{bandwidth for observations below cutoff, equal to +#' \item{\code{hm}}{bandwidth for observations weakly below cutoff, equal to #' \code{hp} unless \code{bw.equal==FALSE}} #' #' \item{\code{sigma2m}, \code{sigma2p}}{estimate of conditional variance diff --git a/doc/RDHonest.R b/doc/RDHonest.R index 7381e7c..7233095 100644 --- a/doc/RDHonest.R +++ b/doc/RDHonest.R @@ -78,7 +78,7 @@ RDHonest(voteshare ~ margin, data=lee08, kern="uniform", M=M, sclass="H", opt.cr ## ----------------------------------------------------------------------------- -## Add variance estimate to the lee data so that the RDSmoothnessBound +## Add variance estimate to the Lee (2008) data so that the RDSmoothnessBound ## function doesn't have to compute them each time dl <- NPRPrelimVar.fit(dl, se.initial="nn") diff --git a/doc/RDHonest.Rmd b/doc/RDHonest.Rmd index 431249f..74e5347 100644 --- a/doc/RDHonest.Rmd +++ b/doc/RDHonest.Rmd @@ -47,10 +47,10 @@ In the sharp regression discontinuity model, we observe units $i=1,\dotsc,n$, with the outcome $y_i$ for the $i$th unit given by $$ y_i = f(x_i) + u_i, $$ where $f(x_i)$ is the expectation of $y_i$ conditional on the running variable $x_i$ and $u_i$ is the regression error. A unit is treated if and only if the -running variable $x_{i}$ lies above a known cutoff $c_{0}$. The parameter of -interest is given by the jump of $f$ at the cutoff, $$ \beta=\lim_{x\downarrow -c_{0}}f(x)-\lim_{x\uparrow c_{0}}f(x).$$ Let $\sigma^2(x_i)$ denote the -conditional variance of $u_i$. +running variable $x_{i}$ lies weakly above a known cutoff $x_{i}\geq c_{0}$. The +parameter of interest is given by the jump of $f$ at the cutoff, $$ +\beta=\lim_{x\downarrow c_{0}}f(x)-\lim_{x\uparrow c_{0}}f(x).$$ Let +$\sigma^2(x_i)$ denote the conditional variance of $u_i$. In the @lee08 dataset, the running variable corresponds to the margin of victory of a Democratic candidate in a US House election, and the treatment corresponds to @@ -63,7 +63,8 @@ occurred in 1947. The running variable is the year in which the individual turne 14, with the cutoff equal to 1947 so that the "treatment" is being subject to a higher minimum school-leaving age. The outcome is log earnings in 1998. -Some of the functions in the package require the data to be transformed into a custom `RDData` format. This can be accomplished with the `RDData` function: +Some of the functions in the package require the data to be transformed into a +custom `RDData` format. This can be accomplished with the `RDData` function: ```{r} library("RDHonest") @@ -241,6 +242,10 @@ variable is discrete, with $G$ support points: their construction makes no assumptions on the nature of the running variable (see Section 5.1 in @KoRo16 for more detailed discussion). +Note that units that lies exactly at the cutoff are considered treated, since +the definition of treatment is that the running variable + $x_i\geq c_0$. + As an example, consider the @oreopoulos06 data, in which the running variable is age in years: ```{r} ## Replicate Table 2, column (10) @@ -393,7 +398,7 @@ The package also implements lower-bound estimates for the smoothness constant $M$ for the Taylor and Hölder smoothness class, as described in the supplements to @KoRo16 and @ArKo16optimal ```{r} -## Add variance estimate to the lee data so that the RDSmoothnessBound +## Add variance estimate to the Lee (2008) data so that the RDSmoothnessBound ## function doesn't have to compute them each time dl <- NPRPrelimVar.fit(dl, se.initial="nn") @@ -443,8 +448,9 @@ different, but the worst-case bias and the point estimate are identical. ## Model -In a fuzzy RD design, the treatment $d_{i}$ is not entirely determined by -whether the running variable $x_{i}$ exceeds a cutoff. Instead, the cutoff +In a fuzzy RD design, units are assigned to treatment if their running variable +$x_{i}$ weakly exceeds a cutoff $x_i\geq c_{0}$. However, the actual treatment +$d_{i}$ does not perfectly comply with the treatment assignment. Instead, the cutoff induces a jump in the treatment probability. The resulting reduced-form and first-stage regressions are given by \begin{align*} @@ -454,8 +460,12 @@ See Section 3.3 in @ArKo16honest for a more detailed description. In the @battistin09 dataset, the treatment variable is an indicator for retirement, and the running variable is number of years since being eligible to -retire. The cutoff is $0$. (individuals exactly at the cutoff are dropped). -Similarly to the `RDData` function, the `FRDData` function transforms the data into an appropriate format: +retire. The cutoff is $0$. Individuals exactly at the cutoff are dropped from +the dataset. If there were individuals exactly at the cutoff, they are assumed +to be assigned to the treatment group. + +Similarly to the `RDData` function, the `FRDData` function transforms the data +into an appropriate format: ```{r} ## Assumes first column in the data frame corresponds to outcome, diff --git a/doc/RDHonest.pdf b/doc/RDHonest.pdf index 2216df5..ea5e604 100644 Binary files a/doc/RDHonest.pdf and b/doc/RDHonest.pdf differ diff --git a/doc/lpkernels.pdf b/doc/lpkernels.pdf index 4c1a6e1..f47c66b 100644 Binary files a/doc/lpkernels.pdf and b/doc/lpkernels.pdf differ diff --git a/doc/manual.pdf b/doc/manual.pdf index 89112f3..b35d2f4 100644 Binary files a/doc/manual.pdf and b/doc/manual.pdf differ diff --git a/man-roxygen/RDBW.R b/man-roxygen/RDBW.R index 4235f68..9022801 100644 --- a/man-roxygen/RDBW.R +++ b/man-roxygen/RDBW.R @@ -1,6 +1,6 @@ #' @param h bandwidth, a scalar parameter. For fuzzy or sharp RD, it can be a #' named vector of length two with names \code{"p"} and \code{"m"}, in which -#' case the bandwidth \code{h["m"]} is used for observations below the -#' cutoff, and the bandwidth \code{h["p"]} is used for observations above -#' the cutoff. If not supplied, optimal bandwidth is computed according to -#' criterion given by \code{opt.criterion}. +#' case the bandwidth \code{h["m"]} is used for observations strictly below +#' the cutoff, and the bandwidth \code{h["p"]} is used for observations +#' weakly above the cutoff. If not supplied, optimal bandwidth is computed +#' according to criterion given by \code{opt.criterion}. diff --git a/man/FRDHonest.Rd b/man/FRDHonest.Rd index 2f408a6..6a3259c 100644 --- a/man/FRDHonest.Rd +++ b/man/FRDHonest.Rd @@ -80,10 +80,10 @@ cutoff should be constrained to equal to each other.} \item{h}{bandwidth, a scalar parameter. For fuzzy or sharp RD, it can be a named vector of length two with names \code{"p"} and \code{"m"}, in which -case the bandwidth \code{h["m"]} is used for observations below the -cutoff, and the bandwidth \code{h["p"]} is used for observations above -the cutoff. If not supplied, optimal bandwidth is computed according to -criterion given by \code{opt.criterion}.} +case the bandwidth \code{h["m"]} is used for observations strictly below +the cutoff, and the bandwidth \code{h["p"]} is used for observations +weakly above the cutoff. If not supplied, optimal bandwidth is computed +according to criterion given by \code{opt.criterion}.} \item{se.method}{Vector with methods for estimating standard error of estimate. If \code{NULL}, standard errors are not computed. The elements of diff --git a/man/FRDOptBW.Rd b/man/FRDOptBW.Rd index f0ca4a0..b9eaaf6 100644 --- a/man/FRDOptBW.Rd +++ b/man/FRDOptBW.Rd @@ -125,10 +125,10 @@ Returns an object of class \code{"RDBW"}. The function \code{print} class \code{"RDBW"} is a list containing the following components: \describe{ - \item{\code{hp}}{bandwidth for observations above cutoff} + \item{\code{hp}}{bandwidth for observations weakly above cutoff} - \item{\code{hm}}{bandwidth for observations below cutoff, equal to - \code{hp} unless \code{bw.equal==FALSE}} + \item{\code{hm}}{bandwidth for observations strictly below cutoff, equal + to \code{hp} unless \code{bw.equal==FALSE}} \item{\code{sigma2m}, \code{sigma2p}}{estimate of conditional variance just above and just below cutoff, \eqn{\sigma^2_+(0)} and diff --git a/man/LPPHonest.Rd b/man/LPPHonest.Rd index 2f62f43..a1cb639 100644 --- a/man/LPPHonest.Rd +++ b/man/LPPHonest.Rd @@ -76,10 +76,10 @@ contain \code{NA}s. The default is set by the \code{na.action} setting of \item{h}{bandwidth, a scalar parameter. For fuzzy or sharp RD, it can be a named vector of length two with names \code{"p"} and \code{"m"}, in which -case the bandwidth \code{h["m"]} is used for observations below the -cutoff, and the bandwidth \code{h["p"]} is used for observations above -the cutoff. If not supplied, optimal bandwidth is computed according to -criterion given by \code{opt.criterion}.} +case the bandwidth \code{h["m"]} is used for observations strictly below +the cutoff, and the bandwidth \code{h["p"]} is used for observations +weakly above the cutoff. If not supplied, optimal bandwidth is computed +according to criterion given by \code{opt.criterion}.} \item{se.method}{Vector with methods for estimating standard error of estimate. If \code{NULL}, standard errors are not computed. The elements of diff --git a/man/NPRHonest.fit.Rd b/man/NPRHonest.fit.Rd index 371655f..9ca5f73 100644 --- a/man/NPRHonest.fit.Rd +++ b/man/NPRHonest.fit.Rd @@ -35,10 +35,10 @@ either be a string equal to \code{"triangular"} (\eqn{k(u)=(1-|u|)_{+}}), \item{h}{bandwidth, a scalar parameter. For fuzzy or sharp RD, it can be a named vector of length two with names \code{"p"} and \code{"m"}, in which -case the bandwidth \code{h["m"]} is used for observations below the -cutoff, and the bandwidth \code{h["p"]} is used for observations above -the cutoff. If not supplied, optimal bandwidth is computed according to -criterion given by \code{opt.criterion}.} +case the bandwidth \code{h["m"]} is used for observations strictly below +the cutoff, and the bandwidth \code{h["p"]} is used for observations +weakly above the cutoff. If not supplied, optimal bandwidth is computed +according to criterion given by \code{opt.criterion}.} \item{opt.criterion}{Optimality criterion that bandwidth is designed to optimize. The options are: diff --git a/man/NPRreg.fit.Rd b/man/NPRreg.fit.Rd index 4397e98..c39f79c 100644 --- a/man/NPRreg.fit.Rd +++ b/man/NPRreg.fit.Rd @@ -20,10 +20,10 @@ NPRreg.fit( \item{h}{bandwidth, a scalar parameter. For fuzzy or sharp RD, it can be a named vector of length two with names \code{"p"} and \code{"m"}, in which -case the bandwidth \code{h["m"]} is used for observations below the -cutoff, and the bandwidth \code{h["p"]} is used for observations above -the cutoff. If not supplied, optimal bandwidth is computed according to -criterion given by \code{opt.criterion}.} +case the bandwidth \code{h["m"]} is used for observations strictly below +the cutoff, and the bandwidth \code{h["p"]} is used for observations +weakly above the cutoff. If not supplied, optimal bandwidth is computed +according to criterion given by \code{opt.criterion}.} \item{kern}{specifies kernel function used in the local regression. It can either be a string equal to \code{"triangular"} (\eqn{k(u)=(1-|u|)_{+}}), diff --git a/man/RDHonest.Rd b/man/RDHonest.Rd index 9f945f5..3f9bf49 100644 --- a/man/RDHonest.Rd +++ b/man/RDHonest.Rd @@ -79,10 +79,10 @@ cutoff should be constrained to equal to each other.} \item{h}{bandwidth, a scalar parameter. For fuzzy or sharp RD, it can be a named vector of length two with names \code{"p"} and \code{"m"}, in which -case the bandwidth \code{h["m"]} is used for observations below the -cutoff, and the bandwidth \code{h["p"]} is used for observations above -the cutoff. If not supplied, optimal bandwidth is computed according to -criterion given by \code{opt.criterion}.} +case the bandwidth \code{h["m"]} is used for observations strictly below +the cutoff, and the bandwidth \code{h["p"]} is used for observations +weakly above the cutoff. If not supplied, optimal bandwidth is computed +according to criterion given by \code{opt.criterion}.} \item{se.method}{Vector with methods for estimating standard error of estimate. If \code{NULL}, standard errors are not computed. The elements of diff --git a/man/RDHonestBME.Rd b/man/RDHonestBME.Rd index 6c2b4bc..412e87a 100644 --- a/man/RDHonestBME.Rd +++ b/man/RDHonestBME.Rd @@ -42,10 +42,10 @@ contain \code{NA}s. The default is set by the \code{na.action} setting of \item{h}{bandwidth, a scalar parameter. For fuzzy or sharp RD, it can be a named vector of length two with names \code{"p"} and \code{"m"}, in which -case the bandwidth \code{h["m"]} is used for observations below the -cutoff, and the bandwidth \code{h["p"]} is used for observations above -the cutoff. If not supplied, optimal bandwidth is computed according to -criterion given by \code{opt.criterion}.} +case the bandwidth \code{h["m"]} is used for observations strictly below +the cutoff, and the bandwidth \code{h["p"]} is used for observations +weakly above the cutoff. If not supplied, optimal bandwidth is computed +according to criterion given by \code{opt.criterion}.} \item{alpha}{determines confidence level, \eqn{1-\alpha}{1-alpha}} diff --git a/man/RDOptBW.Rd b/man/RDOptBW.Rd index 32c307d..7160288 100644 --- a/man/RDOptBW.Rd +++ b/man/RDOptBW.Rd @@ -121,9 +121,9 @@ Returns an object of class \code{"RDBW"}. The function \code{print} class \code{"RDBW"} is a list containing the following components: \describe{ - \item{\code{hp}}{bandwidth for observations above cutoff} + \item{\code{hp}}{bandwidth for observations strictly above cutoff} - \item{\code{hm}}{bandwidth for observations below cutoff, equal to + \item{\code{hm}}{bandwidth for observations weakly below cutoff, equal to \code{hp} unless \code{bw.equal==FALSE}} \item{\code{sigma2m}, \code{sigma2p}}{estimate of conditional variance diff --git a/tests/testthat/test_rd.R b/tests/testthat/test_rd.R index 1696d95..689cd73 100644 --- a/tests/testthat/test_rd.R +++ b/tests/testthat/test_rd.R @@ -104,8 +104,8 @@ test_that("Honest inference in Lee and LM data", { expect_equal(r$maxbias, ff(r$hp, "uniform", "supplied.var")$maxbias) r <- es("triangular", "nn") - expect_equal(r$hm, 22.80882408) - expect_equal(unname(r$estimate+r$hl), 0.05476609) + expect_lt(abs(r$hm- 22.80882408), 5e-7) + expect_lt(unname(r$estimate+r$hl- 0.05476609), 1e-7) ## End replication ## Replicate 1511.06028v2 diff --git a/vignettes/RDHonest.Rmd b/vignettes/RDHonest.Rmd index 136ff91..74e5347 100644 --- a/vignettes/RDHonest.Rmd +++ b/vignettes/RDHonest.Rmd @@ -47,10 +47,10 @@ In the sharp regression discontinuity model, we observe units $i=1,\dotsc,n$, with the outcome $y_i$ for the $i$th unit given by $$ y_i = f(x_i) + u_i, $$ where $f(x_i)$ is the expectation of $y_i$ conditional on the running variable $x_i$ and $u_i$ is the regression error. A unit is treated if and only if the -running variable $x_{i}$ lies above a known cutoff $c_{0}$. The parameter of -interest is given by the jump of $f$ at the cutoff, $$ \beta=\lim_{x\downarrow -c_{0}}f(x)-\lim_{x\uparrow c_{0}}f(x).$$ Let $\sigma^2(x_i)$ denote the -conditional variance of $u_i$. +running variable $x_{i}$ lies weakly above a known cutoff $x_{i}\geq c_{0}$. The +parameter of interest is given by the jump of $f$ at the cutoff, $$ +\beta=\lim_{x\downarrow c_{0}}f(x)-\lim_{x\uparrow c_{0}}f(x).$$ Let +$\sigma^2(x_i)$ denote the conditional variance of $u_i$. In the @lee08 dataset, the running variable corresponds to the margin of victory of a Democratic candidate in a US House election, and the treatment corresponds to @@ -63,7 +63,8 @@ occurred in 1947. The running variable is the year in which the individual turne 14, with the cutoff equal to 1947 so that the "treatment" is being subject to a higher minimum school-leaving age. The outcome is log earnings in 1998. -Some of the functions in the package require the data to be transformed into a custom `RDData` format. This can be accomplished with the `RDData` function: +Some of the functions in the package require the data to be transformed into a +custom `RDData` format. This can be accomplished with the `RDData` function: ```{r} library("RDHonest") @@ -241,6 +242,10 @@ variable is discrete, with $G$ support points: their construction makes no assumptions on the nature of the running variable (see Section 5.1 in @KoRo16 for more detailed discussion). +Note that units that lies exactly at the cutoff are considered treated, since +the definition of treatment is that the running variable + $x_i\geq c_0$. + As an example, consider the @oreopoulos06 data, in which the running variable is age in years: ```{r} ## Replicate Table 2, column (10) @@ -443,8 +448,9 @@ different, but the worst-case bias and the point estimate are identical. ## Model -In a fuzzy RD design, the treatment $d_{i}$ is not entirely determined by -whether the running variable $x_{i}$ exceeds a cutoff. Instead, the cutoff +In a fuzzy RD design, units are assigned to treatment if their running variable +$x_{i}$ weakly exceeds a cutoff $x_i\geq c_{0}$. However, the actual treatment +$d_{i}$ does not perfectly comply with the treatment assignment. Instead, the cutoff induces a jump in the treatment probability. The resulting reduced-form and first-stage regressions are given by \begin{align*} @@ -454,8 +460,12 @@ See Section 3.3 in @ArKo16honest for a more detailed description. In the @battistin09 dataset, the treatment variable is an indicator for retirement, and the running variable is number of years since being eligible to -retire. The cutoff is $0$. (individuals exactly at the cutoff are dropped). -Similarly to the `RDData` function, the `FRDData` function transforms the data into an appropriate format: +retire. The cutoff is $0$. Individuals exactly at the cutoff are dropped from +the dataset. If there were individuals exactly at the cutoff, they are assumed +to be assigned to the treatment group. + +Similarly to the `RDData` function, the `FRDData` function transforms the data +into an appropriate format: ```{r} ## Assumes first column in the data frame corresponds to outcome,