Material covered in class

Additional material

Preliminary Reading

I think that the following material should all be review from last semester.

Early in the semester, I would like for you to read chapters 1, 2, 6, and appendix A and B as you have time. Please pay particular attention to the following sections:

2.5-2.25, 2.28, 6.1-6.7, A.1, A.3-A.6, A.10-A.11, A.20

and, at least, additionally read:

all of chapter 1, the remaining sections of chapter 2, 6.8, A.7-A.9, A.21-A.23

Additional comments

Lecture Notes

Why Linear Regression?

Related Reading: H 2.18, H 2.11

We will spend a lot of time this semester studying properties of \(\beta\) defined as

\[ \beta = \underset{b}{\arg \min} \ \mathbb{E}[(Y - X'b)^2] \]

This has the solution (make sure you recall how to show this)

\[ \beta = \mathbb{E}[XX']^{-1}\mathbb{E}[XY] \]

Here are some reasons to be interested in \(\beta\):

  1. \(X'\beta\) is the best linear predictor of \(Y\) given \(X\) (see H 2.18)

  2. If we additionally (somehow) know that \(\mathbb{E}[Y|X] = X'\beta\), then \(X'\beta\) is the best predictor of \(Y\) given in \(X\) (see H 2.11)

Both of these are properties related to making predictions. This is interesting/useful in lots of contexts. For example, suppose that you work as an appraiser and want to predict how much a house will sell for, this suggests that you could run a regression of the price that houses sell for on their characteristics (e.g., number of square feet, number of bathrooms, etc). This would give you an estimate of \(\hat{\beta}\). Suppose that you wanted to predict the selling price of a house with characteristics \(x\); the above results suggest that \(x'\hat{\beta}\) should be a good prediction (relative to trying to use the same information in some other way to make a prediction).

These sorts of prediction problems are extremely common (you can especially imagine that this important in numerous business/tech applications), and there have been major advancements in using data to make predictions over the past 20-30 years.

That said, most research in economics (and social sciences, business fields, etc.) is not so much interested in predictions, per se. For example, my secondary research interest is in labor economics. In labor economics, there is tons of work where the outcome is a person’s earnings. However, I have never seen any researcher who was primarily interested in taking a person’s characteristics (say, their years of education, demographic characteristics, etc.) and making predictions about what their earnings will be.

Instead, much research in labor economics concerns the effects of different policies (e.g., minimum wage policies) or other interventions (e.g., a person participating in a union, going to college, or losing their job) on earnings. Learning about “effects” is related to making predictions (and we’ll see that most of the main tools for prediction are also useful for evaluating effects of policies/interventions), but there are also some subtle and important differences.

Regression Derivatives

Related Reading H 2.14

Following the textbook, we’ll use the shorthand notation \(m(x) := \mathbb{E}[Y|X=x]\). We will often be interested in the regression derivative. An example of a regression derivative is

\[ \frac{\partial \, \mathbb{E}[Y|X=x]}{\partial \, x_1} \] which holds when \(x_1\) is continuously distributed. This derivative should be interpreted as how much \(Y\) changes, on average, when \(x_1\) increases by one unit holding the other regressors constant.

You can also define a regression derivative when \(X_1\) is discrete. For example, suppose that \(X_1\) is binary (so it only takes the value 0 or 1), then the regression derivative is given by

\[ \mathbb{E}[Y|X_1=1, X_2=x_2, \ldots, X_k=x_k] - \mathbb{E}[Y|X_1=0, X_2=x_2, \ldots X_k=x_k] \]

You could similarly define a regression derivative for the case where \(X_1\) was discrete but took more possible values.

In order to unify notation, we write

\[ \begin{aligned} \nabla_1 m(x) := \begin{cases} \frac{\partial \, \mathbb{E}[Y|X=x]}{\partial x_1} & \textrm{ if $x_1$ is continuous} \\ \mathbb{E}[Y|X_1=1, X_2=x_2, \ldots, X_k=x_k] - \mathbb{E}[Y|X_1=0, X_2=x_2, \ldots X_k=x_k] & \textrm{ if $x_1$ is binary } \end{cases} \end{aligned} \] There is nothing unique about defining partial effects for just \(X_1\), and we can likewise define partial effects for \(X_2, \ldots, X_k\), for example, \(\nabla_2 m(x)\) is the partial effect of \(X_2\).

The regression derivatives above are also sometimes called the “partial effect” of \(x_1\) or the “marginal effect” of \(x_1\).

Some Comments

  • First, partial effects hold other regressors constant. But they do not hold other variables that are not in the model constant.

  • Second, you should notice that \(\nabla_1 m(x)\) is a function of \(x\). If you plug in different values of \(x\), then the value of this function could change. For example, if you take \(X_1\) to be a binary variable indicating whether or not an individual attended college, \(Y\) to be their earnings, and \(X_2\) to be a person’s age, you could imagine that the partial effect of college differs depending on a person’s age.

  • Third, partial effects are really about averages rather than individual-level effects. Continuing the example of the return to going to college — you can easily imagine that, holding age constant, the effect of going to college on a person’s earnings may very (perhaps tremendously across different people). The regression derivative averages over all of these individual-level effects while holding age constant.

Notation: I’ll follow the convention in the book by writing

\[ \mathbb{E}[Y|X=x] = x_1\beta_1 + x_2 \beta_2 + \cdots + x_{k-1}\beta_{k-1} + \beta_k \] so that the “intercept” is in the last position. More specifically,

\[ X = \begin{pmatrix}X_1 \\ X_2 \\ \vdots \\ X_{k-1} \\ 1 \end{pmatrix} \qquad \qquad \beta = \begin{pmatrix} \beta_1 \\ \vdots \\ \beta_k \end{pmatrix} \]

Under the linear CEF model where \(m(x) = x'\beta\), \(\nabla_1 m(x) = \beta_1\) (up to cases where there are included interaction terms, quadratic terms, etc.).

Causal Effects

H 2.30

Now, let’s move to thinking about causal effects. I’ll talk briefly about how to think about this conceptually and then how this is related to regression derivatives and linear regression.

Notation: In cases (like in the current section) where we are interested in understanding the effect of particular variable, I may denote it by \(D\) (which is common in many academic papers), while referring to all remaining regressors as \(X\) (I’ll probably also use the term “covariates” for these other regressors).

Binary Treatment

[I think most of the material in this section will be review, so I’ll cover it relatively quickly.]

Work on understanding the effect of a particular variable of interest on some outcome is typically called the “treatment effects literature”. This terminology originates from the biostatistics literature where a treatment could literally refer to a medical treatment. We’ll use the term treatment more broadly to refer to a policy or some intervention that we are interested in studying.

Let’s start with the case where the treatment is binary; that is \(D_i=1\) if a unit participates in the treatment and \(D_i=0\) if a unit does not participate in the treatment.

We’ll also define potential outcomes \(Y_i(1)\) and \(Y_i(0)\) — these are the outcomes that a unit would experience if it participated in the treatment or if it did not participate in the treatment, respectively. For any, particular unit, the researcher only observes one of these potential outcomes; that is, for treated units, we observe their treated potential outcomes, and for untreated units, we observe their untreated potential outcomes. We can therefore write the observed outcome as

\[ Y_i = D_i Y_i(1) + (1-D_i)Y_i(0) \]

and, it is convenient to note that this can also be written as

\[ Y_i = Y_i(0) + D_i (Y_i(1) - Y_i(0)) (\#eq:observed-outcomes) \]

which follows just by re-arranging terms from the previous equation.

Target Parameters

In the context of a binary treatment, much research targets one of the following two parameters:

\[ \begin{aligned} ATE &:= \mathbb{E}[Y(1) - Y(0)] \\ ATT &:= \mathbb{E}[Y(1) - Y(0) | D=1] \end{aligned} \] \(ATE\) stands for “average treatment effect” and \(ATT\) stands for “average treatment effect on the treated”. \(ATE\) is the average difference between treated and untreated potential outcomes for the entire population. \(ATT\) is the average difference between treated and untreated potential outcomes among those that participate in the treatment.

It may seem like \(ATE\) is inherently more interesting than \(ATT\), but I don’t think this is necessarily the case. To give an example, suppose you are interested in studying the causal effect of job training on people’s earnings. Presumably, the effect of job training is exactly 0 for a large portion of the population. In this case, \(ATT\) is probably the more relevant parameter to aim to identify — it is the average effect of job training among those that actually participate.

For much of the course, we will target identifying the \(ATT\) — at the beginning of the course, this is mainly to make the arguments more concise, and we could instead target \(ATE\). That said, there are some cases where we will explicitly target \(ATE\), and there will be some other case (particularly when we discuss panel data) where it would require different sorts of arguments to identify \(ATE\) relative to \(ATT\).

Experiments

If we had access to an experiment (that is, that we could randomly assign units to either participate in the treatment or not), it would follow that

\[ (Y(1),Y(0)) \perp D (\#eq:random-assignment) \]

In words, if we can randomly assign treatment, then (by construction) potential outcomes are independent of participating in the treatment. More informally, there is “nothing special” about units that participate in the treatment relative to those that do not participate in the treatment (at least in terms of their potential outcomes).

Let’s think about identifying \(ATT\) under random assignment as in Equation @ref(eq:random-assignment). Notice that

\[ \begin{aligned} ATT &= \mathbb{E}[Y(1) - Y(0) | D=1] \\ &= \mathbb{E}[Y(1) | D=1] - \mathbb{E}[Y(0) | D=1] \\ &= \underbrace{\mathbb{E}[Y|D=1]}_{\textrm{Easy}} - \underbrace{\mathbb{E}[Y(0)|D=1]}_{\textrm{Hard}} \end{aligned} \]

The previous display indicates that \(ATT\) is equal to the average outcome actually experienced by the treated group relative to the average outcome among those in the treated group if they had not participated in the treatment. The first term is “easy” because those outcomes are observed outcomes. The second term is “hard” because we do not observe untreated potential outcomes for the treated group.

However, Equation @ref(eq:random-assignment) implies that \(\mathbb{E}[Y(0)|D=1] = \mathbb{E}[Y(0)|D=0]\). That is, because untreated potential outcomes are independent of treatment, the average untreated potential outcome among the treated group is the same as the average untreated potential outcome among the untreated group. This, therefore, implies that (given random assignment):

\[ ATT = \mathbb{E}[Y|D=1] - \mathbb{E}[Y|D=0] \] That is, we can recover the \(ATT\) by comparing the average outcomes among the treated group relative to the average outcomes among the untreated group.

Practice: Given the above expression for \(ATT\), what is the natural way to estimate \(ATT\)?

Now, let’s think about how to estimate causal effects using a regression (and given random assigment) — this is going to be very simple, but I think it is worth explaining so that we can use the same sorts of procedures in more complicated cases below.

Let’s write an extremely simple model for untreated potential outcomes:

\[ Y_i(0) = \beta_0 + e_i (\#eq:y0-model-experiment) \]
By construction, we have that \(\mathbb{E}[e]=0\), but random assigment also implies that \(\mathbb{E}[e|D=d] = 0\) for \(d \in \{0,1\}\). To see this, notice that \(\mathbb{E}[Y(0)|D=d] = \beta_0 + \mathbb{E}[e|D=d]\). Recall that random assignment implies that \(\mathbb{E}[Y(0)|D=1]=\mathbb{E}[Y(0)|D=1]\), therefore it must be the case that \(\mathbb{E}[e|D=1]=\mathbb{E}[e|D=0]=0\).

Let’s also make an additional assumption called treatment effect homogeneity. In math, we can write this as \(Y_i(1) - Y_i(0) = \alpha\). This means that the effect of participating in the treatment is the same for all units (and is equal to \(\alpha\)). This is probably a strong assumption; in my view, one would expect that the effect of participating in most any treatment could conceivably vary across units (especially in economics, social sciences, and most business applications). But let’s just make this assumption for now — we’ll talk about it much more in the future.

Next, notice that \[ \begin{aligned} Y_i &= Y_i(0) + D_i (Y_i(1) - Y_i(0)) \\ &= Y_i(0) + \alpha D_i \\ &= \beta_0 + \alpha D_i + e_i (\#eq:reg-experiment) \end{aligned} \] where the first equality comes from @ref(eq:observed-outcomes), the second equality holds by treatment effect homogeneity, and the last equality holds from @ref(eq:y0-model-experiment) and by rearranging terms. Moreover, because \(\mathbb{E}[e|D] = 0\), this suggests estimating \(\alpha\) (the causal effect of the treatment) by running a regression of \(Y\) on \(D\).

To conclude this discussion, it is interesting to notice that, given the regression in @ref{eq:reg-experiment}, \[ \begin{aligned} \mathbb{E}[Y | D=1] &= \beta_0 + \alpha \\ \mathbb{E}[Y|D=0] &= \beta_0 \end{aligned} \] and subtracting the second equation from the first equation and re-arranging implies that \[ \alpha = \mathbb{E}[Y|D=1] - \mathbb{E}[Y|D=0] \] which further implies that \(\alpha=ATT\). This is interesting because we derived the regression in @ref{eq:reg-experiment} under the extra condition of treatment effect homogeneity. However, that \(\alpha=ATT\) implies that this regression is robust to treatment effect heterogeneity.

Unconfoundedness

In most application in economics, researchers do not have access to an experiment (or, alternatively, do not have the ability to randomly assign units to participate in the treatment or not). In cases with “observational” data (meaning: non-experimental data), one of the most common assumptions for thinking about causal effects is the following unconfoundedness assumption (you may also sometimes here this called selection-on-observables, and the textbook refers to this as a conditional independence assumption):

\[ (Y(1), Y(0)) \perp D | X \] Unconfoundedness says that potential outcomes are independent of the treatment after conditioning on some covariates \(X\). Informally, unconfoundedness means that, among units with the same characteristics \(X\), the distribution of treated and untreated potential outcomes is the same among the treated and untreated group (though the distribution of \(X\) could differ across groups). If you want to assume unconfoundedness, this often needs to be rationalized (perhaps informally) theoretically.

Side Comment: Sometimes the assumption that \(Y(0) \perp D | X\) can be meaningfully weaker that what I have called unconfoundedness above. In particular, this assumption just implies that treated and untreated units with the same characteristics \(X\) have the same distribution of untreated potential outcomes (but would allow for treated units to, for example, have systematically better treated potential outcomes that untreated units). The assumption in this comment is strong enough to identify \(ATT\), but it is not strong enough to identify \(ATE\).

To connect this to running a regression, let’s make some additional assumptions. First, let’s assume a model for untreated potential outcomes: \[ Y_i(0) = X_i'\beta + e_i \] This is a linearity assumption for untreated potential outcomes. Notice that unconfoundedness implies that \(\mathbb{E}[Y(0) | X, D=1] = \mathbb{E}[Y(0) | X, D=0]\) which (given linearity) implies that \(\mathbb{E}[e|X,D=d] = 0\) for \(d \in \{0,1\}\). Next, let’s make the treatment effect homogeneity assumption that \(Y_i(1) - Y_i(0) = \alpha\). Then,

\[ \begin{aligned} Y_i &= Y_i(0) + D_i(Y_i(1) - Y_i(0)) \\ &= Y_i(0) + \alpha D_i \\ &= \alpha D_i + X_i'\beta + e_i \end{aligned} \] where the first equality holds by (@ref{eq:observed-outcomes}), the second equality holds by the treatment effect homogeneity condition, and the third equality holds by the model for untreated potential outcomes and by rearranging. This equation suggests estimating the causal effect of \(D\) on \(Y\) by running a regression of \(Y\) on \(D\) and \(X\) and interpreting the estimated coefficient on \(D\) as an estimate of the causal effect.

Unlike in the earlier case of random assignment, this regression is not robust to violations of treatment effect homogeneity. Later in the semester, we will talk about exactly what this regression recovers in the presence of treatment effect heterogeneity, and we will also talk about some alternative methods that are more robust to violations of treatment effect homogeneity. It is also not robust to violations of the linear model for untreated potential outcomes. I am not totally sure about this, but my sense is that, in cases where unconfoundedness holds, that the “empirical relevance” of violations of treatment effect homogeneity and linearity of untreated potential outcomes are relatively small.

Continuous Treatment

So far, we have talked about the case with a binary treatment. Next, let’s move to the case where the treatment can take on a continuum of values. I’ll talk here about the case where the treatment can take values in \(\mathcal{D} = \{0\} \cup [d_L,d_U]\). In other words, it is possible that some units do not participate in the treatment at all, but, otherwise, the treatment is continuous in the range from \(d_L\) to \(d_U\). I won’t cover intermediate cases such as a multi-valued discrete treatment, but the arguments would basically be a combination of the ones in this section with the ones in the previous section with binary treatment.

We use \(D_i\) to denote the actual amount of the treatment that unit \(i\) experiences. We’ll define potential outcomes using a slightly extended notation from the previous extension. In particular, let \(Y_i(d)\) denote the outcome that would occur for unit \(i\) if they were to experience dose \(d\). The observed outcome is given by

\[ \begin{aligned} Y_i &= Y_i(D_i) \\ &= Y_i(0) + (Y_i(D_i) - Y_i(0)) (\#eq:observed-outcomes-continuous-treatment) \end{aligned} \] In other words, we observe outcomes corresponding to the actual amount of the treatment for a particular unit. The second equality holds by adding and subtracting \(Y_i(0)\) and will be helpful in some derivations below. [As a side-comment, in cases where it is not possible to be untreated or where defining untreated potential outcomes is somehow “awkward”; the arguments below will follow with trivial modifications by replacing “untreated” with the smallest possible amount of the treatment.]

Let’s briefly talk about the sorts of parameters that you could be interested in for this case. One sort of parameters are “level effects” such as \[ \begin{aligned} ATT(d) &:= \mathbb{E}[Y(d) - Y(0) | D=d] \\ ATE(d) &:= \mathbb{E}[Y(d) - Y(0)] \end{aligned} \] These are quite similar to \(ATT\) and \(ATE\) that we talked about in the case with a binary treatment. \(ATT(d)\) is the average difference between potential outcomes under dose \(d\) relative to untreated potential outcomes among those that actually experienced dose \(d\). \(ATE(d)\) is the overall average difference between potential outcomes under dose \(d\) relative to untreated potential outcomes.

When the treatment is continuous, it also makes sense to think about “slope effects” that are derivatives of the above parameters. For example, one could be interested average causal response

\[ ACR(d) := \frac{ \partial \, ATE(d) }{\partial \, d} \]

Side-Comment: Another interesting target parameter would be a derivative of \(ATT(d)\), though this is somewhat conceptually harder to think about. In particular, let’s expand the notation above to define

\[ ATT(d|d') = \mathbb{E}[Y(d) - Y(0) | D=d'] \] so that this is the average difference between potential outcomes under dose \(d\) relative to untreated potential outcomes among those that experienced dose \(d'\) — which breaks the connection between the dose for the potential outcomes and the dose being conditioned on.

Then, one can define the average causal response on the treated \[ ACRT(d|d') := \frac{\partial \, ATT(l|d')}{\partial \, l}\Big|_{l=d} \] This is the effect of a marginal increase in the treatment (relative to dose \(d\)) among those that actually experienced dose \(d'\).

At the cost of somewhat stronger assumptions (in some cases), we’ll mostly target \(ACR(d)\), mostly for simplicity.

Side Comment: \(ACR(d)\) is a functional parameter — you could plug in different values of \(d\) and \(ACR(d)\) could take a different value. Many times researchers would like to report a single number to summarize the causal effect of a treatment. In this case, a natural summary measure is

\[ ACR^O := \mathbb{E}[ACR(D) | D>0] \] which is just \(ACR\) averaged over the distribution of the dose.

Let’s start with the case where the amount (sometimes this is called the “dose”) of the treatment is randomly assigned. This implies that, for all \(d \in \mathcal{D}\),

\[ Y(d) \perp D \] In other words, potential outcomes are independent of the amount of the treatment. Let’s also make a treatment effect homogeneity assumption: for all \(d \in \mathcal{D}\), \(Y_i(d) - Y_i(0) = \alpha d\). Notice that this implies that \[ \begin{aligned} Y'(d) &:= \lim_{h \rightarrow 0} \frac{Y_i(d+h) - Y_i(d)}{h} \\ & = \lim_{h \rightarrow 0} \frac{\alpha (d+h) - \alpha d}{h} \\ &= \alpha. \end{aligned} \] where the second line uses the treatment effect homogeneity assumption, and the last line follows just from canceling terms. This means that \(\alpha\) should be interpreted as how much outcomes causally increase under a one unit increase in the dose, and (under the assumptions we have made) this is constant across units and across different amounts of the dose.

As in the previous section, treatment effect homogeneity is likely to be very strong. As in the case with a binary treatment, it restricts treatment effects to be constant across units. In this case it is additionally potentially restrictive in that it requires that the causal effect of more dose is is the same regardless of the “starting dose.” As before, let us delay trying to relax this assumption and/or thinking about what potential issues it could cause and just go with it for now.

Finally, let’s use the same model for untreated potential outcomes as in (@ref{eq:y0-model-experiment}), where from random assignment, it holds that \(\mathbb{E}[e|D=d] = 0\).

Now, notice that

\[ \begin{aligned} Y_i &= Y_i(0) + (Y_i(D_i) - Y_i(0)) \\ &= Y_i(0) + \alpha D_i \\ &= \beta_0 + \alpha D_i + e_i \end{aligned} \] where the first equality uses (@ref{eq:observed-outcomes-continuous-treatment}), the second equality uses treatment effect homogeneity, and the third equality uses (@ref{eq:y0-model-experiment}) and re-arranges terms. This discussion suggests (in the case where the amount of the treatment is randomly assigned and under treatment effect homogeneity) to run a regression of \(Y\) on \(D\) and interpret \(\alpha\) as the causal effect of a marginal increase in the dose.

Like the case of unconfoundedness above, treatment effect homogeneity matters in a potentially meaningful way here. We’ll come back to this issue in a few weeks. As in the previous case, my sense is that running the above regression would still be the leading approach to estimating causal effects in this case though, and it is not entirely clear to me how much using alternative approaches that are robust to treatment effect heterogeneity actually matter.

To conclude this section, let’s briefly consider the case of a continuous treatment under unconfoundedness. That is, let’s assume that, for all \(d \in \mathcal{D}\),

\[ Y(d) \perp D | X \]

Let’s make some assumptions that lead to using a regression to estimate the causal effect of a small increase in the dose. As in the case of a binary treatment under unconfoundedness, let’s assume that untreated potential outcomes are generated by the following linear model:

\[ Y_i(0) = X_i'\beta + e_i \] Unconfoundedness implies that \(\mathbb{E}[e|X,D=d] = 0\) for all \(d \in \mathcal{D}\). Next, let’s make the treatment effect homogeneity assumption that, for all \(d \in \mathcal{D}\), Y_i(d) - Y_i(0) = d$. Then,

\[ \begin{aligned} Y_i &= Y_i(0) + (Y_i(D_i) - Y_i(0)) \\ &= Y_i(0) + \alpha D_i \\ &= \alpha D_i + X_i'\beta + e_i \end{aligned} \] which holds using similar arguments as we have used before and suggests estimating the causal effect of a marginal increase in the dose by running a regression of \(Y\) on \(D\) and \(X\).

As you would expect (given that this is the most complicated setup we have considered so far), this regression is not robust to violations of treatment effect homogeneity or misspecification of the model for untreated potential outcomes.