Thursday, December 12, 2013

Update on strata and heterogenous assignment

In my last post I was dealing with an issue where the assignment to treatment differed across strata. Duflo et. al. in the Handbook note that an OLS regression with indicators for the strata is comparable to averaging the difference between treatment vs. control across strata, where the weights are the probability of treatment conditional on being in particular strata.

The notation on pg 3935 in the handbook is quite sloppy, and so is the statement:

"In general, controlling for variables that have a large effect on the outcome can help
reduce standard errors of the estimates and thus the sample size needed. This is a reason
why baseline surveys can greatly reduce sample size requirement when the outcome

variables are persistent."

This statement seems to imply that they're suggesting this regression is correct:


No. The correct regression is:


where each dummy is interacted with the treatment. The ATE is now the sum of the coefficients on the dummies interacted with the treatment. 

Also, Macartan Humphreys has a well written paper on hetero effects:

Thursday, December 5, 2013

Strata Weights from Duflo, Glennerster, Kremer 2008

Written with StackEdit.

RESEARCH: A TOOLKIT**, the authors discuss how to weight an average treatment effect, when the probability of treatment varies within strata.

I had such an issue, and so looked to this page for help. But the information provided on this page does not seem to be correct. Essentially, the authors say that weighting the treatment effect (namely, the average difference between treated and untreated) in each strata by the probability of being in that strata, conditional on treatment is equivalent to running a regression of the outcome on the treatment dummy controlling for strata dummies (and all interactions between strata dummies, if the sample was stratified along more than one dimension, say city and gender).

This is likely not the case, as the variance covariance matrix from a weighted regression versus one with additional dummies will not be the same.

The ATE is measured by:
[E(YT|X,T)E(YC|X,C)] and we are interested in the overall effect:

When the data are stratified, Duflo et. al. suggest the following:
Ex[E(YT|X,T)E(YC|X,C)]= Σx(E(YT|X,T)E(YC|X,C))P(X=x|T)
(duflo_toolkit.pdf, pg 3935), where, using data,

  • Y–>outcome
  • X–>strata
  • T–>treatment==1 and C–> treatment==0

The authors use a continuous strata in their equation (an integral sign). I’m not sure why, as strata are usually discrete. The authors leave it at the above, but I’ll walk you through what those weights mean first.

So , what is P(X=x|T)? It would be great if the authors spelled this out.

where P(T)=P(T|X)P(X)

An example of computing these weights follows with my own data.

An Example

Here is a summary of observations from an experiment. You can see from this simple table, that the probability of being treated in each strata is very different. In strata 1, 107 out of 648 individuals were treated, whereas in strata 2 98 out of 1,075 individuals were treated. So we need to correct for this, if we’d like an average effect of the treatement variable.


So, P(X=x|T)=P(T|X=x)P(X=x)P(T), where, for example, the weight in strata 1 is: 1077551414(107755+981173+991650+982119)=0.427495568

Handbook’s Suggested Methodologies

Supposed we’d like to estimate the effect of the treatment on an outome. Our hypothesis is:

H0: The effect of the treatment on outcome is zero.

Duflo et. al. gives two methods of estimating the treatment effect correcting for the fact that the probability of treatment depends on the strata as follows:

  1. Run a regression with controls for each strata

  2. Use weights in the form of: P(X=x|T)=P(T|X=x)P(X=x)P(T)

Both (1) and (2) can be done in a regression form. Given that, how does the OLS estimator (or probit, since my primary outcome variable is binary) and it’s standard errors (clustered at the strata level) change for (1) and (2)? Is β^ the same? Is the variance covariance matrix and therefore inference the same for (1) and (2)? Probably not.

General OLS estimators:

β=(XX)1(XY) be the effect of treatment on the outcome, and let

sd(β)=Σs(XsXs)1(XsΩsXs)(XsXs)1, where s denotes strata. Note that X1 is 775x1, while X2 is 1173x1, etc.





Ωs=E(ϵsϵs), for strata s, with Ns individuals, and vector of error terms ϵs, is:


1.If we use controls, X=[treatment, strata1, strata2, strata3,strata4], Y=[outcome] (I guess we can have a dummy for each strata and suppress the intercept or, take one strata dummy out). β and sd(β) are as given above.

2.If we use weights, then X=[treatment]’, Y=[outcome]’, and W=[vector of strata weights]’, for example, w1=P(X=1|T)=.42, and is constant within stratas 1

β=(XW1X)1(XW1Y), and

sd(β)=(XWX)1(XWΩWX)(XWX)1, where:

Ωs=E(ϵsϵs), for strata s, with Ns individuals, a vector of error terms ϵs,

Perhaps (1) and (2) could produce equivalent results for β if the added controls in (1) create a weighted sum of the x’s for each strata that is equivalent to the weights.

In terms of the variance of beta, Ωs should not be the same as Ωs, given that the error terms come from different linear models between (1) and (2).

More General Issues on Heterogenous Treatment Effects

Stratifying a random sample can often be used to get a representative sample within strata, allowing the research to look at the treatment effect by strata. Without stratification, if we subset the data to a certain city, for example, we can’t be sure that our sample within that city is representative of the population.

From a Bayesian perspective, this would be equivalent to saying that our prior distribution of the data is incomplete–we’re missing a whole matrix of individuals within that city who would respond differently to the treatment than the ones we happened to pick up in our sample that was not stratified by city.

In fact, some great Bayesians, like Andrew Gelman, have writen about this issue from a Bayesian perspective:

There are several other methodologies out there that relate to heterogenous treatment effects that use machine learning methods rather than the chop, dice and data mine methods:

Imai’s work at Princeton essentially says, rather than look at all the interaction effects between treatments and strata to decide on which treatment is best for which strata, let’s reduce the problem to just a few treatments and strata a priori. The algorithm is here, but it’s essentially an optimization problem with an added constraint that dampens the effect of some interaction effects (hopefully, I’m getting this right):

Yet, one more method that I haven’t delved into is by Grimmer et. al.:

The traditional go-to method in the sciences of throwing in many interaction effects between treatment(s) and a strata (or whatever you may be conditioning) is wrought with two major issues:
1. It expands your parameter space, requiring a much larger sample to maintain the same power as before.
2. Designs don’t exogenously vary both the treatment AND the strata conditioned on, so the interaction cannot necessarily be interpreted as causal.

It’s time to consider new methods, I think.