1) that you your outcome variable may have different variances across treatment arms, so your sample sizes may vary across treatment arms if you're trying to get the smaller possible sample size for a given power and significance (see http://www.nber.org/papers/w15701.pdf?new_window=1, pg 8 equation 7 for the derivation of different N across treatment groups). I wrote the authors to see if they had code to accompany this derivation, and if they've thought about this derivation for more than 2 groups, and they said no.

2) multiple testing increases your type I error rate. See pg 5 of these slides. Basically, 5% significance says we have a 1 in 20 chance of rejecting the null when it is true. So If we make 20 hypothesis tests on our data, then 1 of them is bound to show up as significant.

So how do we deal with both these issues?

We could just compute the power calcs separately for each pair of Treatment and Control, i.e. 3 pairs:

T1 and C

T2 and C

T3 and C

But then wouldn't deal with the multiple testing issue in that we want to test:

H1: Y_T1 !=Y_C

H2: Y_T2 !=Y_C

H3: Y_T3 !=Y_C

We could just test:

H0: Y_T1 !=Y_T2!=Y_T3!=Y_C

and avoid the multiple testing issue, but then that doesn't tell us which treatment is effective.

The Stats Columbia Department told me that statisticians normally use simulation for this kind of method. The simulation would go something like this:

Step 1:

Generate data according to your hypothesized distributions of the outcome variable for which you're interested in detecting a change. Here is where you can impose your different variance across groups:

eg. Y_c~N(10,10)

eg. Y_T1~N(10,5)

eg. Y_T2~N(10,20)

eg. Y_T3~N(10,100)

Generate data, start with an equal number of observations between each group, say 200.

C | T1 | T2 | T3 | |

1 | 2 | 3 | 4 | |

. | . | . | . | |

. | . | . | . | |

. | . | . | . | |

. | . | . | . | |

. | . | . | . | |

. | . | . | . | |

. | . | . | . | |

N= | 200 | 200 | 200 | 200 |

Now test:

H1: Y_T1 !=Y_C

H2: Y_T2 !=Y_C

H3: Y_T3 !=Y_C

Step 2:

Repeat Step 1 about 1,000 times.

Step 3:

Calculate the percentage of times that you reject H1, H2, and H3.

Is that rejection rate > or < 5%?

Step 4:

Change up the split of N across each treatment group, raising N_i for the group that has higher variance and lowering it for the group that has lower variance.

How you'd increment the N for each group is unknown to me. This seems like you could loop through an infinite number of possibilities increasing and decreasing N in various combinations.

Have an ideas???

In general, it would be nice to code this up for all practitioners out there. My guess is that researchers, aside from the statisticians at columbia, don't do something this complicated. I can tell you that pretty well known economist told me yesterday that he doesn't even compute power. He just looks at how much money he has as a participation fee and that sets his N!

By the way, some packages out there to do vanilla power calcs are:

pwr in R and sampsi/sampclus in stata.

http://www.statmethods.net/stats/power.html

http://economics.ozier.com/owen/slides/ozier_powercalc_talk_20100914a.pdf