Quasi-Double Robustness of Linear Regression for Treatment Effects

.title[
# Quasi-Double Robustness of Linear Regression for Treatment Effects
]
.author[
### Carolina Caetano1, Brantly Callaway1, Tymon Słoczyński2 1University of Georgia, 2Brandeis University 
]
.date[
### April 22, 2023 Georgia Econometrics Workshop
]

---

# Introduction

`$$\newcommand{\E}{\mathbb{E}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\var}{\mathrm{var}}
\newcommand{\cov}{\mathrm{cov}}
\newcommand{\Var}{\mathrm{var}}
\newcommand{\Cov}{\mathrm{cov}}
\newcommand{\Corr}{\mathrm{corr}}
\newcommand{\corr}{\mathrm{corr}}
\newcommand{\L}{\mathrm{L}}
\renewcommand{\P}{\mathrm{P}}
\newcommand{\independent}{{\perp\!\!\!\perp}}$$`

border-top: 80px solid #BA0C2F;

.inverse {
  background-color: #BA0C2F;
}

.alert {
    font-weight:bold; 
    color: #BA0C2F;
}

.alert-blue {
    font-weight: bold;
    color: blue;
}

.remark-slide-content {
    font-size: 23px;
    padding: 1em 4em 1em 4em;
}

.highlight-red {
 background-color:red;
 padding:0.1em 0.2em;
}

.assumption-box {
    background-color: rgba(222,222,222,.5);
    font-size: x-large;
    padding: 10px; 
    border: 10px solid lightgray; 
    margin: 10px;
}

.assumption-title {
 font-size: x-large;
 font-weight: bold;
 display: block;
 margin: 10px;
 text-decoration: underline;
 color: #BA0C2F;
}
</style>

We consider a simple, observational setting where a researcher is interested in understanding the causal effect of a binary treatment under the assumption of unconfoundedness

`$$(Y(1),Y(0)) \independent D | X$$`

In this setting, (arguably) the natural target parameter is
`\begin{align*}
  ATE &:= \E[Y(1) - Y(0)] = \E\Big[ ATE(X) \Big] 
\end{align*}`

where
`\begin{align*}
  ATE(X) := \E[Y(1)-Y(0)|X] = \E[Y|X,D=1] - \E[Y|X,D=0]
\end{align*}`

---

# Introduction

By far the most commonly used estimation strategy in this setting is to estimate the following regression

`$$Y_i = \alpha D_i + X_i'\beta + e_i$$`

and `$\alpha$` is interpreted as the causal effect of the treatment (or, under treatment effect heterogeneity, as (hopefully) a weighted average of conditional average treatment effects)

---

# Previous Work

In particular, Angrist (1998) and Aronow and Samii (2016) provide conditions under which

`$$\alpha = \E\left[ \frac{p(X)(1-p(X))}{\E[p(X)(1-p(X))]} ATE(X) \right]$$`

where the first term are "weights" and `$ATE(X) := \E[Y(1) - Y(0) | X]$` is the conditional average treatment effect.

These weights have the following properties:

1. Mean 1

2. Guaranteed to be non-negative (Blandhol et al. (2022) call this "weakly causal")

3. "Inherited" from the estimation strategy (notice that `$p(X)(1-p(X))$` is the conditional variance of the treatment `$\implies$` more weight on `$ATE(X)$` when this variance is high)

---

# But...

To derive this result, previous papers invoke the assumption that the propensity score is linear in covariates, that is,

`$$p(X) = \L(D|X)$$`

where `$\L(D|X)$` is the (population) linear projection of `$D$` on `$X$`, i.e.,

`$$\L(D|X) := X'\E[XX']^{-1} \E[XD] = X'\gamma$$`

[[Sketch Proof](#sketch-proof)]

---

# Linearity of the Propensity Score

And it's not clear whether or not this is a reasonable condition:

* Angrist (1998) / MHE talks about the case where all covariates are discrete (fairly common in empirical work) and model is fully saturated in covariates (less common)

- This is also a case where it would be straightforward to just recover `$ATE$` directly rather than being content with the weighted average.
    
--

* Ishimaru (2021), Caetano, Callaway, Payne, Rodrigues (2022), Goldsmith-Pinkham, Hull, and Kolesar (2022) argue that (by construction) it is difficult for the propensity score to be linear in settings such as RD and DID.

* [[DID Example](#did-example)]

---
# Current Paper

Our question: Are there alternative conditions that can "rationalize" using the regression to estimate a weighted average of conditional average treatment effects?

* while allowing for treatment effect heterogeneity

Some high-level thoughts:

* High-leve thought 1: It seems strange that the key condition for interpreting `$\alpha$` concern the propensity score...why not other linearity conditions on the outcome?

---

#  High-level thought 2: Double Robustness

There is large literature in statistics/econometrics about double robustness (Robins, Rotnitzky, and Zhao (1994), Słoczyński and Wooldridge (2018), many others)

A typical result in this literature is that one can consistently estimate the `$ATE$` (or `$ATT$`, etc.) if *either*:

(1) A model for the propensity score is correctly specified

(2) Outcome regression models are correctly specified

For condition (2), a leading example is that:

* Linearity of model for untreated potential outcomes: `$\E[Y|X,D=0] = \L_0(Y|X) := X'\beta_0$`.
    
* Linearity of model for treated potential outcomes: `$\E[Y|X,D=1] = \L_1(Y|X) := X'\beta_1$`

Is this sort of condition relevant for interpreting `$\alpha$`?

---

# Thought 3: Implicit Weighting / Outcome Modeling

A number of papers have also noted that a different estimation strategies often *implicitly* "balance" or estimate an outcome model

Examples:

* Regression - Chattopadhyay and Zubizarreta (2022) interpret regressions like we consider through the lens of re-weighting observations and derive a number of interesting properties

* Regression adjustment - Kline (2011) shows that regression adjustment estimators implicitly fit an inverse linear propensity score model

* Entropy balancing - Zhao and Percival (2016) show that entropy balancing implicitly estimates an outcome regression model

These results imply extra (and not obvious) conditions under which these estimation strategies can recover causal effect parameters.

---

# Related Work

Besides the work mentioned above that assumes linearity of the propensity score (see also Słoczyński (2020))

Several recent papers have related results that provide related results in different contexts:

* Blandhol, Bonney, Mogstad, Torgovitsky (2022) -- Two stage least squares

* Goldsmith-Pinkham, Hull, Kolesar (2022) -- Multiple treatments

* Caetano, et al. (2022) -- Difference-in-differences

* Ishimura (2022) -- Continuous treatment

---

# Outline

1. Two decompositions of `$\alpha$`

2. Two main results on causally interpreting `$\alpha$`

3. Some extensions / discussion

4. Empirical Exercise

---

# Decomposition 1

`$\alpha$`, the regression coefficient on `$D$`, can be decomposed as follows:
`\begin{align*}
    \alpha = \E\Big[ w_0(D,X) (\L_1(Y|X) - \L_0(Y|X)) \Big]
\end{align*}`
where `$w_0(D,X)$` are weights that are given by
`\begin{align*}
    w_0(D,X) = \frac{D(1-\L(D|X))} {\E\Big[(D-\L(D|X))^2\Big]}
\end{align*}`

Properties:

* `$\E[w(D,X)] = 1$`

* It is possible that `$w_0(D,X)$` can be negative for some values of `$D$` and `$X$`.

* Since all of the terms in this decomposition are linear projections, it's straightforward to compute all terms here

---

# Decomposition 2

`$\alpha$`, the regression coefficient on `$D$`, can be decomposed as
`\begin{align*}
    \alpha &= \E\left[w_0(D,X) \Big( \E[Y|X,D=1] - \E[Y|X,D=0] \Big) \right] \\
    & + \E\left[w_0(D,X)\Big(\E[Y|X,D=0] - \L_0(Y|X) \Big) \right] 
\end{align*}`

* It's not so easy to compute all the terms in this decomposition (particularly the conditional expectations)

* This is a useful intermediate step for explaining our main results next

* Under unconfoundedness, the first term is going to be a weighted average of `$ATE(X)$`

* The second term is going to be a nuisance term (a kind of misspecification or nonlinearity bias term) that we would like to be equal to 0

---

# Main Result 1

Under (i) unconfoundedness, (ii) overlap, and *either* (iii) linearity of the propensity score... `$p(X) = \L(D|X)$` *or* (iv) linearity of model for untreated potential outcomes... `$\E[Y|X,D=0] = \L_0(Y|X)$`, then

`\begin{align*}
    \alpha &= \E\left[w_0(D,X) ATE(X) \right]
\end{align*}`

This is closely related to results in Goldsmith-Pinkham et al. (2022) (in particular) and also results in Blandhol et al. (2022) and Caetano, et al. (2022).

Additional Comments:

* The weights are guaranteed to be non-negative under linearity of the propensity score

* Under linearity of the model for untreated potential outcomes, the weights are estimable (and one can check for negative weights by checking if `$\L(D|X) \leq 1$` uniformly among the treated group).

* Interestingly (surprisingly??), condition (iv) required *only* linearity of `$\E[Y|X,D=0]$` but not necessarily linearity of `$\E[Y|X,D=1]$`.

---

# Main Result 2

Alternatively, using closely related arguments, you can additionally show the following result:

Under (i) unconfoundedness, (ii) overlap, and *either* (iii) linearity of the propensity score... `$p(X) = \L(D|X)$` *or* (iv) linearity of model for treated potential outcomes... `$\E[Y|X,D=1] = \L_1(Y|X)$`, then

`\begin{align*}
    \alpha &= \E\left[w_1(D,X) ATE(X) \right]
\end{align*}`
where 
`\begin{align*}
    w_1(D,X) = \frac{(1-D)\L(D|X)} {\E\Big[(D-\L(D|X))^2\Big]}
\end{align*}`

---

# Main Results 2 (cont'd)

Additional Comments:

* In general, `$w_1(D,X) \neq w_0(D,X)$` ... even `$\E[w_1(D,X) | X] \neq \E[w_0(D,X)|X)]$` in general (implying that the results, in general, put different weights on `$ATE(X)$`.)

* A sufficient condition for the weights to be non-negative is that `$\L(D|X) \geq 0$` uniformly among the untreated group

* A sufficient condition for the weights to be equal is that `$p(X) = \L(D|X)$`.

* If `$p(X) \neq \L(D|X)$`, and `$\E[Y|X,D=d] = \L_d(Y|X)$` for *both* `$d=1$` and `$d=0$`, then `$\alpha$` is equal to two different weighted averages of `$ATE(X)$` (in principle, both could have all positive weights)

---

# Discussion

About negative weights:

* Much of the literature (and empirical work) has emphasized the problem of negative weights that are possible here (and often possible in under-specified regression models).

* Still...in practice, negative weights are often small in magnitude, and there is much more variation in the positive weights

* And you can come up with weighting schemes where the weights are all positive but are still very poor weighting schemes

* The above results make it easy to to check (under our assumptions) whether or not there are negative weights.

* This discussion matters more when there is more treatment effect heterogeneity

---

# Discussion

Possible advantages of regressions?

One of the strange things about the "interpreting regressions" literature (in my view) is that the imposed conditions to interpret the regression coefficient (as a weighted average) often would allow you to just directly target causal effect parameters directly:

* Ex. If you know, `$p(X) = \L(D|X)$`, then you could directly target `$ATE$` using propensity score re-weighting

* Seems strange...but this is what we are currently thinking about

---

# ATT Version

Can show: under `$Y(0) \independent D | X$`, if (i) `$p(X) = \L(D|X)$` or (ii) `$\E[Y|X,D=0] = \L_0(Y|X)$`, then 
`\begin{align*}
  \alpha = \E\left[ \frac{(1-\L(D|X))}{\E[(1-\L(D|X))|D=1]} ATT(X) \Big| D=1 \right]
\end{align*}`

Alternatively, if (a) `$p(X) = \L(D|X)$` or (b) `$\E[Y|X,D=1] = \L_1(Y|X)$`, then
`\begin{align*}
  \alpha = \E\left[ \frac{\frac{1-p(X)}{p(X)}\L(D|X)}{\E\left[\frac{1-p(X)}{p(X)}\L(D|X) \Big|D=1\right]} ATT(X) \Big| D=1 \right]
\end{align*}`

Comments:

* If `$p(X) = \L(D|X)$`, then the weights in the two expressions are equal (but not in general), and the weights are guaranteed to be positive

---

# ATT Version

Comments:

* The second result is (perhaps) surprising because, typically, in order to recover `$ATT$`, you would need either condition (a) or to be able to model untreated potential outcomes (as you cannot directly see these for the treated group).

* Can derive similar results for `$ATU$`

---

# Empirical Example

How much do the issues that we have been talking about above matter in practice?

Address this with an (extremely) small scale application.  First, a couple of issues:

* All of the results above are at the population level

* Conditional expectations / propensity score hard to estimate nonparametrically in realistic applications

---

# Empirical Example

Our idea 1: Treat data *as if* it is the data generating process.

Our idea 2: Punt on unconfoundedness "really" holding, but instead target 
`\begin{align*}
 \theta = \E\Big[ \E[Y|X,D=1] - \E[Y|X,D=0] \Big]
\end{align*}`

* `$\theta = ATE$` under unconfoundedness, but can still illustrate the issues discussed above even if unconfoundedness doesn't hold

* We choose `$X$` to be of the lowest possible dimension so that `$p(X)$` and `$\E[Y|X,D=d]$` are not linear by construction

---

# mtcars Data

```
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
```

---

# Setup

We will be interested in:

* `$Y$` : `mpg` - a car's miles per gallon

* `$D$` : `vs` - shape of the engine ( `$D=1$` for straight engines, `$D=0$` for V-shaped engines)

* `$X$` : `gear` the number of gears that the car has

* only takes values 3, 4, or 5 here

---

# "Data Generating Process"

Some things to notice

* `$p(X)$` is not close to being linear

* `$\E[Y|X,D=0]$` not close to being linear either

* But `$\E[Y|X,D=1]$` is *very close* to being linear

---

# Regression Components

Notes:

* There is substantial variation in `$ATE(X)$`

* `$\implies$` the weighting scheme matters tremendously
  
---

# Regression Components

Notes:

* `$ATE$` weights come from `$\P(X=x)$`, i.e., the fraction of the population where `$X=x$`.

* Note the `$w_1$` and `$w_0$` weights come from calculating `$\E[w_d(D,X)|X]$` and multiplying by `$\P(X=x)$` (this gives the "fraction" of weight for each `$ATE(X)$`.)

---

# Regression Components

Notes:

* `$p(X)$` weights are the "variance" weights that occur under the assumption of linearity of the propensity score. It's mainly a coincidence that the `$p(X)$` weights are close to the `$ATE$` weights (there happens to be little variation in `$p(X)(1-p(X))$` across different values of `$X$`)

---

# Regression Components

Notes:
 
* There are no negative weights in any cases but the weights are much different from each other (mainly due to the highly nonlinear propensity score)

---

# Treatment Effects

Notes:

* `$\alpha$` is substantially larger than the `$ATE$` (this is a little more than 20% larger)

* `$\alpha$` is also outside the "convex hull" of `$ATT$` and `$ATU$`

---

# Treatment Effects

Notes:

* The columns labeled `$w_d \, ATE(X)$` multiply the two regression weights onto the conditional `$ATE$`'s

* Because `$p(X)$` is not linear, these two quantities are not very close to each other.

* Because `$\E[Y|X,D=0]$` is not linear either, `$w_1 \, ATE(X)$` is not close to `$\alpha$`.

* The difference is a misspecification bias ("level-dependence") that is hard to interpret.  It is about 45% as large as the "interpretable" weighted average of conditional `$ATT$`'s.

--
  
* Because `$\E[Y|X,D=1]$` is close to being linear, `$w_0 \, ATE(X)$` is close to `$\alpha$`.  That provides a rationalization to interpreting `$\alpha$` as a weighted average of conditional `$ATT$`'s.

---

# Discussion

Caveats to our "empirical exercise"

* There's a lot of treatment effect heterogeneity here.  It's possible that this is an artifact of using a very small dataset (could be confusing "noise" with heterogeneous effects).

* On the other hand, presumably, more likely that linearity is close to holding in this very small scale application

* It "matters" that only one of the three linearity conditions is required here

* In practice, it is hard to see how a researcher would know which of the three linearity conditions was more likely to hold.

* Finally, there are no negative weights here `$\implies$` we can interpret as being "weakly causal"

* Despite this: `$\alpha$` is still far from the ATE

---

# Conclusion

* We provide some new conditions under which the coefficient on a binary treatment can be interpreted as being "weakly causal"

* We find it especially interesting that this result can hold under *any* of the three linearity conditions discussed above

* This is a brand new project for us, so we greatly appreciate any feedback

---

count:false
class: inverse, middle, center

# Thanks!

---

# Sketch Proof

$$
`\begin{aligned}
\small \alpha & \small = \E\left[ \frac{(D-\L(D|X))}{\E[(D-\L(D|X))^2]}Y\right] \hspace{500pt}
\end{aligned}`
$$

---

count:false
# Sketch Proof

---

count:false
# Sketch Proof

---

count:false
# Sketch Proof

$$
`\begin{aligned}
\small \alpha & \small = \E\left[ \frac{(D-\L(D|X))}{\E[(D-\L(D|X))^2]}Y\right] \hspace{500pt}\\
& \small = \E\left[ \frac{(1-\L(D|X))}{\E[(D-\L(D|X))^2]}Y\Big| D=1 \right]p - \E\left[ \frac{\L(D|X))}{\E[(D-\L(D|X))^2]}Y\Big| D=0 \right](1-p)\\
& \small = \E\left[ \frac{(1-\L(D|X))}{\E[(D-\L(D|X))^2]}\E[Y|X,D=1]\Big| D=1 \right]p - \E\left[ \frac{\L(D|X))}{\E[(D-\L(D|X))^2]}\E[Y|X,D=0]\Big| D=0 \right] (1-p)\\
& \small = \E\left[ \frac{p(X) (1-\L(D|X))}{\E[(D-\L(D|X))^2]}\E[Y|X,D=1] \right] - \E\left[ \frac{(1-p(X))\L(D|X))}{\E[(D-\L(D|X))^2]}\E[Y|X,D=0] \right]
\end{aligned}`
$$

---

count:false
# Sketch Proof

* The underlined term is undesirable.

* Blandhol et al. (2022) refer to this type of term as "level-dependence"

* It is equal to 0 if `$p(X) = \L(D|X)$`. &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [[Back](#angrist)]
---

# DID Example

Suppose you have two time periods and run the following regression in first differences:

`$$\Delta Y_{it} = \alpha D_{it} + \Delta X_{it}'\beta + \Delta e_{it}$$`

In this setting, the equivalent condition to linearity of the propensity score is that

`$$p(\Delta X) = \L(D|\Delta X)$$`
But the "leading" case where this holds (discrete/saturated) no longer holds

* E.g., suppose `$X_{it}$` is binary

* The propensity score is no longer linear by construction ( `$\Delta X$` can take 3 possible values (-1, 0, 1), but there are 4 possible combinations of `$X_{t-1}$` and `$X_t$`).

[[Back](#linear-pscore)]