Merge pull request #132 from dlab-berkeley/kz

kaseyzapatka · web-flow · commit abba5b029207 · 2025-04-02T22:38:32.000-07:00
udpate to regressing A on Z instead of Z on A
diff --git a/6 Causal Inference/6-5 Instrumental Variables/Instrumental Variables Solutions.Rmd b/6 Causal Inference/6-5 Instrumental Variables/Instrumental Variables Solutions.Rmd
@@ -458,23 +458,23 @@ There is no empirical way to determine whether the "exclusion restriction" requi
 
 The "first stage" requirement (that $Z$ must have a causal effect on $A$), however, can be empirically tested, and as the name implies, doing so is indeed the first stage in implementing an instrumental variable analysis. 
 
-To do so, we simply run a linear regression of the intended instrument $Z$ on the exposure $A$ (and any measured confounders $W$ that we have determined appropriate to control for):
+To do so, we simply regress the intended instrument $Z$ on the exposure $A$ (and any measured confounders $W$ that we have determined appropriate to control for) using a simple linear regression:
 
-$$Z = \beta_0 + \beta_1A + \epsilon$$
-If this regression results in a high correlation value, $Z$ is considered a **strong** instrument and we may proceed. If correlation is low, however, $Z$ is considered a **weak** instrument and may be a poor choice of instrument.
+$$A = \beta_0 + \beta_1Z + \beta_2W + \epsilon$$
+If this regression results in a high correlation value (the regression coefficent), $Z$ is considered a **strong** instrument and we may proceed. If value is low, however, $Z$ is considered a **weak** instrument and may be a poor choice of instrument.
 
-If we decide to move forward with using $Z$ as an instrument, we save the predicted values of the instrument $\hat{Z}$ and the covariance of $Z$ and $A$ ($Cov(Z,A)$) for the next stage.
+If we decide to move forward with using $Z$ as an instrument, we save the predicted values of the treatment $\hat{A}$ that are a function of $Z$ and the covariance of $Z$ and $A$ ($Cov(Z,A)$) for the next stage.
 
 **\textcolor{blue}{Question 6:}** Consider, what are some potential concerns with using a weak instrument?
 
 **Solution:** There are many possible answers, but the primary concern is that $Z$ may not truly have a causal effect on $A$ (or at least, not a very strong one).
 
 ## Second Stage 
 
-Now that we have the predicted values of the instrument $\hat{Z}$, we regress the outcome $Y$ on these values, like so:
+Now that we have the predicted values of the treatment $\hat{A}$, we regress the outcome $Y$ on these values (and any covariates included in the first stage), like so:
 
-$$Y = \beta_0 + \beta_1\hat{Z} + \epsilon$$
-We then retrieve the covariance between $Z$ and $Y$ ($Cov(Z,Y)$). The ratio between this and $Cov(Z,A)$ is then our 2SLS estimate of the coefficient on $A$ in the original model.
+$$Y = \beta_0 + \beta_1\hat{A} + \beta_1W + \epsilon$$
+We then retrieve the covariance between $Z$ and $Y$ ($Cov(Z,Y)$). The ratio between this and $Cov(Z,A)$ is then our 2SLS estimate of the coefficient on $A$ in the original model. *Note that this will differ slightly if you control for any $W$.*
 
 $$\hat{\beta}_1 = \frac{Cov(Z,Y)}{Cov(Z,A)}$$
 
@@ -557,33 +557,33 @@ df <- df %>%
 head(df)
 summary(df)
 ```
-**\textcolor{blue}{Question 8:}** Use the `lm()` function to regress proximity $Z$ on AspiTyleCedrin use $A$ and sex assigned at birth $W$. Assign the predicted values to the variable name `Z_hat`. Use the `cov()` function to find $Cov(Z,A)$ and assign the result to the variable name `cov_za`.
+**\textcolor{blue}{Question 8:}** Use the `lm()` function to regress whether the individual took AspiTyleCedrin ($A$) on proximity to a pharmacy that sells AspiTyleCedrin  $Z$ and sex assigned at birth $W$. Assign the predicted values to the variable name `A_hat`. Use the `cov()` function to find $Cov(Z,A)$ and assign the result to the variable name `cov_za`.
 
 ```{r}
 
 # 1. first stage
 # ----------
-lm_out1 <- lm(Z ~ A + W,  # regress Z (instrument) on A + W
+lm_out1 <- lm(A ~ Z + W,  # regress A (treatment) on Z (instrument) + W (covariates)
               data = df)  # specify data
 
 # view model summary
 summary(lm_out1)
 
 
 # get fitted values (Z-hat)
-Z_hat <- lm_out1$fitted.values
+A_hat <- lm_out1$fitted.values
 
 # get the covariance of Z and A
 cov_za <- cov(df$Z, df$A)
 ```
 
-**\textcolor{blue}{Question 9:}** Use the `lm()` function to regress migraines $Y$ on your fitted values `Z_hat`. Use the `cov()` function to find $Cov(Z,Y)$ and assign the result to the variable name `cov_zy`.
+**\textcolor{blue}{Question 9:}** Use the `lm()` function to regress migraines $Y$ on your fitted values `A_hat`. Use the `cov()` function to find $Cov(Z,Y)$ and assign the result to the variable name `cov_zy`.
 
 ```{r}
 
 # 2. reduced form 
 # ----------
-lm_out2 <- lm(Y ~ Z_hat,  # regress Y (outcome) on fitted values from first stage
+lm_out2 <- lm(Y ~ A_hat + W,  # regress Y (outcome) on fitted values from first stage
               data = df)  # specify data
 
 # view model summary
@@ -595,7 +595,7 @@ cov_zy <- cov(df$Z, df$Y)
 
 **\textcolor{blue}{Question 10:}** Use your `cov_za` and `cov_zy` to estimate the coefficient $\beta_1$ in the following equation:
 
-$$Y = \beta_0 + \beta_1 A + \beta_2 W + \epsilon$$
+$$Y = \beta_0 + \beta_1\hat{A} + \beta_1W + \epsilon$$
 Interpret your result in words.
 
 ```{r}
@@ -606,10 +606,10 @@ beta_hat <- cov_zy/cov_za  # divide Cov(Z,Y) / Cov(Z,A)
 beta_hat
 ```
 
-> When controlling for sex assigned at birth, use of AspiTyleCedrin reduces migraines by approximately 3.8 per month.
+> When controlling for sex assigned at birth, use of AspiTyleCedrin reduces migraines by approximately 3.8 per month. *Note this is slighlty different than the estimated coefficient in the OLS above, likely because the covariance of W was not included.*
 
 
-The `AER` package also provides us with the `ivreg()` function which allows us to perform IV regression in one command:
+The `AER` package also provides us with the `ivreg()` function which allows us to perform IV regression in one command (*note that the standard errors will correctly adjusted when using the `ivreg()` function*:
 
 
 ```{r}
@@ -625,7 +625,7 @@ summary(lm_out3)
 
 **\textcolor{blue}{Question 11:}** Compare the estimate of the coefficient on $A$ in the output above to your previous answer.
 
-> The results are very similar. In this case the estimate using `ivreg()` is slightly larger, but if you repeat this with a difference seed, it might be smaller. So, they will both report similar estimates, which could be due to a rounding error. 
+> The results are identical. However, it should be noted that the output from mannualy plugging in the y-hats won't produce the correct standard errors. Instead, we should use `ivreg()` to get the correct standard errors, which are adjusted since the y-hats are "generated regressors" (generated from the data rather than measured independently of other variables).  
 
 \newpage 
 
diff --git a/6 Causal Inference/6-5 Instrumental Variables/Instrumental Variables Student.Rmd b/6 Causal Inference/6-5 Instrumental Variables/Instrumental Variables Student.Rmd
@@ -446,25 +446,26 @@ There is no empirical way to determine whether the "exclusion restriction" requi
 
 ## First Stage 
 
+
 The "first stage" requirement (that $Z$ must have a causal effect on $A$), however, can be empirically tested, and as the name implies, doing so is indeed the first stage in implementing an instrumental variable analysis. 
 
-To do so, we simply run a linear regression of the intended instrument $Z$ on the exposure $A$ (and any measured confounders $W$ that we have determined appropriate to control for):
+To do so, we simply regress the intended instrument $Z$ on the exposure $A$ (and any measured confounders $W$ that we have determined appropriate to control for) using a simple linear regression:
 
-$$Z = \beta_0 + \beta_1A + \epsilon$$
-If this regression results in a high correlation value, $Z$ is considered a **strong** instrument and we may proceed. If correlation is low, however, $Z$ is considered a **weak** instrument and may be a poor choice of instrument.
+$$A = \beta_0 + \beta_1Z + \beta_2W + \epsilon$$
+If this regression results in a high correlation value (the regression coefficent), $Z$ is considered a **strong** instrument and we may proceed. If value is low, however, $Z$ is considered a **weak** instrument and may be a poor choice of instrument.
 
-If we decide to move forward with using $Z$ as an instrument, we save the predicted values of the instrument $\hat{Z}$ and the covariance of $Z$ and $A$ ($Cov(Z,A)$) for the next stage.
+If we decide to move forward with using $Z$ as an instrument, we save the predicted values of the treatment $\hat{A}$ that are a function of $Z$ and the covariance of $Z$ and $A$ ($Cov(Z,A)$) for the next stage.
 
 **\textcolor{blue}{Question 6:}** Consider, what are some potential concerns with using a weak instrument?
 
 **Solution:** ...
 
 ## Second Stage 
 
-Now that we have the predicted values of the instrument $\hat{Z}$, we regress the outcome $Y$ on these values, like so:
+Now that we have the predicted values of the treatment $\hat{A}$, we regress the outcome $Y$ on these values (and any covariates included in the first stage), like so:
 
-$$Y = \beta_0 + \beta_1\hat{Z} + \epsilon$$
-We then retrieve the covariance between $Z$ and $Y$ ($Cov(Z,Y)$). The ratio between this and $Cov(Z,A)$ is then our 2SLS estimate of the coefficient on $A$ in the original model.
+$$Y = \beta_0 + \beta_1\hat{A} + \beta_1W + \epsilon$$
+We then retrieve the covariance between $Z$ and $Y$ ($Cov(Z,Y)$). The ratio between this and $Cov(Z,A)$ is then our 2SLS estimate of the coefficient on $A$ in the original model. *Note that this will differ slightly if you control for any $W$.*
 
 $$\hat{\beta}_1 = \frac{Cov(Z,Y)}{Cov(Z,A)}$$
 
@@ -547,17 +548,17 @@ head(df)
 summary(df)
 ```
 
-**\textcolor{blue}{Question 8:}** Use the `lm()` function to regress proximity $Z$ on AspiTyleCedrin use $A$ and sex assigned at birth $W$. Assign the predicted values to the variable name `Z_hat`. Use the `cov()` function to find $Cov(Z,A)$ and assign the result to the variable name `cov_za`.
+**\textcolor{blue}{Question 8:}** Use the `lm()` function to regress whether the individual took AspiTyleCedrin ($A$) on proximity to a pharmacy that sells AspiTyleCedrin  $Z$ and sex assigned at birth $W$. Assign the predicted values to the variable name `A_hat`. Use the `cov()` function to find $Cov(Z,A)$ and assign the result to the variable name `cov_za`.
 
 ```{r}
 lm_out1 <- lm(..., data = df)
 summary(lm_out1)
 
-Z_hat <- lm_out1$...
+A_hat <- lm_out1$...
 cov_za <- cov(..., ...)
 ```
 
-**\textcolor{blue}{Question 9:}** Use the `lm()` function to regress migraines $Y$ on your fitted values `Z_hat`. Use the `cov()` function to find $Cov(Z,Y)$ and assign the result to the variable name `cov_zy`.
+**\textcolor{blue}{Question 9:}** Use the `lm()` function to regress migraines $Y$ on your fitted values `A_hat`. Use the `cov()` function to find $Cov(Z,Y)$ and assign the result to the variable name `cov_zy`.
 
 ```{r}
 lm_out2 <- lm(..., data = df)
@@ -578,7 +579,7 @@ beta_hat
 > ...
 
 
-The `AER` package also provides us with the `ivreg()` function which allows us to perform IV regression in one command:
+The `AER` package also provides us with the `ivreg()` function which allows us to perform IV regression in one command (*note that the standard errors will correctly adjusted when using the `ivreg()` function*:
 
 ```{r}
 lm_out3 <- ivreg(..., data = df)
diff --git a/6 Causal Inference/6-5 Instrumental Variables/Instrumental-Variables-Solutions.pdf b/6 Causal Inference/6-5 Instrumental Variables/Instrumental-Variables-Solutions.pdf