Project 1: Causal Effect of Job Training

This project considers the effect of participating in a job training program on earnings.

Part 1: Observational Data

The first part of the project involves trying to come up a reasonable estimate of the effect of participating in job training using observational data. To start with, we will estimate the average difference in earnings between workers who participated in job training on those that did not by running a regression of earnings in 1978 on whether or not a person participated in job training.

jtrain_obs <- read_dta("jtrain_observational.dta")
modelsummary(lm(re78 ~ train, data=jtrain_obs), 
             gof_map=c("nobs", "r.squared"))
(Intercept) 21.554
train −15.205
Num.Obs. 2675
R2 0.061

Here we estimate that participants in job training have much lower earnings in 1978 than non-participants. The average earnings for non-participants is about $21,500 while job training participants earn about $15,000 less. This is a huge difference.

It is not reasonable to interpret this difference as the causal effect of job training. First, in order to rationalize this comparison as an estimate of the causal effect of job training on earnings, we would need it to be the case that there are no other confounding variables. That seems unlikely here as (presumably) individuals mainly choose to participate in job training in cases where they are having trouble finding a good job or having other career troubles. We will see further evidence of this in the summary statistics below. Second, while it seems plausible that job training might not improve people’s earnings, it seems very hard to imagine that it could cause a huge reduction in earnings. Thus, this estimate does not pass this simple sanity check of being a reasonable estimate.

Summary Statistics

Next, let us consider summary statistics for various person-level characteristics separately for those who did and did not participate in job training.

jtrain_obs$train_factor <- as.factor(ifelse(jtrain_obs$train==1, "train", "non-train"))
datasummary_balance(~ train_factor, 
                    data=select(jtrain_obs, train_factor, re78, re75, re74, unem75, unem74, age, educ, black, hisp, married),
                    title="Job Training Summary Statistics",
Job Training Summary Statistics
non-train (N=2490)
train (N=185)
Mean Std. Dev. Mean Std. Dev. Diff. in Means Std. Error
re78 21.55 15.56 6.35 7.87 -15.20 0.66
re75 19.06 13.60 1.53 3.22 -17.53 0.36
re74 19.43 13.41 2.10 4.89 -17.33 0.45
unem75 0.10 0.30 0.60 0.49 0.50 0.04
unem74 0.09 0.28 0.71 0.46 0.62 0.03
age 34.85 10.44 25.82 7.16 -9.03 0.57
educ 12.12 3.08 10.35 2.01 -1.77 0.16
black 0.25 0.43 0.84 0.36 0.59 0.03
hisp 0.03 0.18 0.06 0.24 0.03 0.02
married 0.87 0.34 0.19 0.39 -0.68 0.03

From the summary statistics, there are a few things that are immediately noticeable. First, in 1978, earnings are substantially higher among individuals who did not participate in job training relative to those that did participate in job training. If one were to (incorrectly) interpret these differences as being the causal effect of job training, one would be suggesting that there is a tremendously large negative effect of job training on earnigs.

It is also immdiately clear that there are big differences in other variables as well. In 1974 and 1975 (before anyone participated in job training), those who eventually participate in job training had much lower earnings (this immediately suggests that the simple difference in means comparison above is unlikely to provide a reasonable estimate of the causal effect of job training) and were much more likely to be unemployed as well. There are also notable differences in age, education, race, and marital status — all of which could plausibly be related to earnings.

Empirical Strategy

I am going to consider three different approaches to try to estimate the causal effect of participating in job training:

Strategy 1: Regress re78-re75 on train, unem75, unem74, age, educ, black, hisp, married

Strategy 2: Regress re78 on train, re75, re74, unem75, unem74, age, educ, black, hisp, married

Strategy 3: Regress re78-re75 on train

The first strategy is essentially a panel data / difference-in-differences type of strategy but it also includes additional covariates that may explain the change in earnings over time. For the second strategy, the outcome is just the level (not the difference) of earnings in 1978; this strategy includes all the same covariates as in the previous case but additionally includes earnings in 1975 and 1974 as covariates. Finally, the last strategy is a standard version of panel data / difference-in-differences (i.e., doesn’t condition on any other covariates).

My sense is that you can make a reasonably good case for any of these strategies. Before estimating them, I suspect that Strategy 1 will work the best, Strategy 2 second best (though I would not be shocked if the order were reversed between the first two), and Strategy 3 the third best. My main concern with Strategy 3 is that parallel trends will not hold here — mainly I am worried about this because the two groups seem so different and it seems like paths of earnings could very well depend on things like a person’s education, employment history, and age, whether or not the person participates in job training.

I’ll add one more possible strategy too

Strategy 4: re78 on train, age, educ, black, hisp, married

My guess is that Strategy 4 will not perform well, but I include it here because these are these covariates are the sort of demographic variables that are very commonly observed, and, therefore, it seems like a natural model to compare to.

Estimation Results

reg1 <- lm(I(re78-re75) ~ train + unem75 + unem74 + age + educ + black + hisp + married, data=jtrain_obs)
reg2 <- lm(re78 ~ train + re75 + re74 + unem75 + unem74 + age + educ + black + hisp + married, data=jtrain_obs)
reg3 <- lm(I(re78-re75) ~ train, data=jtrain_obs)
reg4 <- lm(re78 ~ train + age + educ + black + hisp + married, data=jtrain_obs)
modelsummary(list(reg1,reg2,reg3,reg4), gof_map=NA)
 (1)   (2)   (3)   (4)
(Intercept) 2.263 0.954 2.491 −9.039
(1.405) (1.371) (0.214) (1.865)
train 0.690 0.115 2.327 −5.890
(1.039) (1.007) (0.814) (1.239)
unem75 5.075 −1.462
(0.876) (0.947)
unem74 −1.658 2.390
(0.939) (1.024)
age −0.134 −0.090 0.167
(0.021) (0.022) (0.028)
educ 0.302 0.514 1.775
(0.074) (0.076) (0.098)
black −0.003 −0.454 −2.731
(0.513) (0.497) (0.680)
hisp 2.825 2.197 0.885
(1.136) (1.092) (1.507)
married 0.922 1.205 4.533
(0.606) (0.585) (0.803)
re75 0.544
re74 0.313

These are very interesting results. Let’s start with the easy one. In Strategy 4, we estimate that job training decreases earnings by almost $6000 — this is clearly an unreasonable estimate. That we get an unreasonable estimate here is not surprising based on our earlier discussion.

The estimates from Strategy 1 and Strategy 2 are actually quite similar. In both cases, the estimated effect is positive but not statistically significant. I would interpret these as not being strong enough to indicate that job training has any effect on earnings, but it is also worth pointing out that the confidence intervals here are fairly wide and don’t rule that job training could have increased people’s earnings by up to about $2000. In Strategy 3, we estimate a postive and statistically significant effect of job training on earnings — here, we estimate that job training increases earnings by a little over $2000 per year.

Before moving forward, I think it is worth also writing down a 95% confidence interval for the estimated effect of job training from the first three strategies

CI-lower CI-upper
Strategy 1 -1.35 2.73
Strategy 2 -1.86 2.09
Strategy 3 0.73 3.92

All in all, based on these results, if only the observational data were available, I would interpret it as saying that, while there may be a relatively small positive effect of job training on earnings, there is not even strong evidence that the job training program has any effect on earnings.

Part 2: Estimates from Experimental Data

For this part, we are going to estimate the causal effect of participating in job training using the experimental data. This should directly deliver a credible estimate of the effect of job training.

jtrain_exp <- read_dta("jtrain_experimental.dta")
exp_reg <- lm(re78 ~ train, data=jtrain_exp)
modelsummary(exp_reg, gof_map=c("nobs"))
(Intercept) 4.555
train 1.794
Num.Obs. 445

Interestingly, the estimate here is that the job training program increased earnings by $1800 on average. This is actually closer to our estimate from Strategy 3 than the preferred estimates from Strategy 1 or Strategy 2. That said, the estimate from the experimental data falls within the 95% confidence interval for all of Strategies 1-3.


Relative to the first project where predicting house prices seemed to go very well, the results from this project were a bit lacking. Based on the experimental data, it seems that the job training program did have a positive (not huge, but certainly economically meaningful) effect on participants earnings. However, the two approaches that I thought would work best for estimating the causal effect estimated smaller overall effects and non-statistically significant effects. The experimental estimate did fall within the 95% confidence interval of both of the non-experimental estimators that I thought would perform well. In some sense, this suggests that I did ok (not great, but not terrible either — and certainly way, way better than the naive comparison of means of earning of job training participants to those who did not participate). Being within the 95% confidence interval also makes it hard to determine if my theory (i.e., estimation strategy) is driving the problem or if the estimates are just kind of noisy and we got a little unlucky.