Monday, July 11, 2016

Power Calcs; R versus Stata



I found this nice post with R code for more complicated power calculations using simulations - which is nice, because it's not the black box.

http://egap.org/methods-guides/10-things-you-need-know-about-statistical-power

They provide 3 stages of power calculations:

  1. Two sided power calculation - no clustering, covariates, or multiple treatment arms
  2. Two sided power calculation with covariates - no clustering, or multiple treatment arms
  3. Two sided power calculation with multiple treatment arms 

  1. Two sided power calculation - no clustering, covariates, or multiple treatment arms
First I compare (1) to Stata, and they report the same sample size: N = 500, split between T1 and Control, assuming the SD for both samples is the same (which, is generally an erroneous assumption. You expect your treatment group to have greater variation in outcomes. But so it is.)

R


Stata 12.1 ( sampsi 60 65, alpha(.05) power(.8) sd(20))


2. Two sided power calculation with covariates - no clustering, or multiple treatment arms

Then I look at their calcs with covariates. When you add in covariates that creates noisier data. When you don't adjust for the reduce sd on estimates that covariates bring you later down the road, then you're overstating N. See the difference between sample_size = 1500 now and sample_size.cov = 500. (I didn't do this in Stata. I think power in Stata 13 lets you add in covariates, and I'm not bothering writing and ado for Stata to do this.)



              3. Two sided power calculation with multiple treatment arms 


First, let's assume the same effect size in both (5) for each treatment group. If we consider only making one comparison (ttest), then N ~ 600.


Stata 12.1 (sampsi 60 65, alpha(.05) power(.8) sd(20) ratio(3))

And that's what Stata's Sampsi command gives too, N= 600 , if we assume the same effect sizes by each treatment and no adjustment to the needed p-value for multiple testing.


In R simulations, the simulations that correct the p-values for multiple testing, show that we don't acheive 80% power with a sample of 2,000 or less.  (possible.ns[which(power.fullranking > 0.8)[1]] returns NA and you can see that series in blue is flat).

      4. The final step that they don't do is to also adjust for clustering. So I add this in. 
I'm working on adding this in to the code in R and rewriting it in Python... tbc.




Monday, April 11, 2016

Looping over numbered vars in Stata vs Python

If you want to loop over a list of variables that increment with a number nested inside the variable name, you can:

Stata

foreach i of numlist 1(1)34 {
      summ var`i'
}

Python. 


for i in range(1,34):
    print(np.mean(eval("x" + str(i))))

Wednesday, January 13, 2016

Margins Command in Stata

This walks through an example to help explain the margins command in Stata for nonlinear regressions.

I use the psi example (*https://www3.nd.edu/~rwilliam/stats3/Margins02.pdf). I was getting confused between the 0.psi and 1.psi in the header and what the coefficeints under Margin in the table were. 

(A)
use http://www3.nd.edu/~rwilliam/statafiles/glm-logit.dta, clear
logit grade gpa tuce i.psi, nolog
margins psi,  atmeans

1.psi (.4375)  is just average psi (or like average females), and it's just the perecentage of psi or females. and 0.psi is 1 - 1.psi. You can just type summ psi, to get .4375


(B)
psi==0 and psi==1 are the probabilities of being psi==0 or psi==1 and the difference (.5632555 - .1067571  ) is the increase in the probability of going from 0 to 1 (where all other variables are held constant at the mean), so
this difference is the same thing as the coefficient on 1.psi ( .4564984  ) after running:
logit grade gpa tuce i.psi, nolog
margins, dydx(*) atmeans

Finally, the delta in psi==0 to psi==1, and dydx@psi at means, will be the same as the marginal effect of that dichotomous variable in a linear regression. Just the standard errors change.
I think that's what I've figured out from all of this so far. Aslo the ucla site was better for me than teh pdf



***

Person A: 
I am still not clear on the difference between margins, atmeans and margins,dydx atmeans. If I run a mutlivariate regression ( y on x1, x2, and x3) and I want the marginal effect for each x, which command do I use? Do I put my varlist after margins, or ‎inside dydx()?

Person B: 
Margins, at means predicts the probabilities of your dichotomous outcome variable holding all other variables constant at their means. 
Margins, dydx at means will calculate the marginal effect of all your variables at the mean value (so if it's a gnarly function, it will only take the derivative at the center of that function).
You want the second: margins, dydx(michelle's variables) atmeans

Person A: 
So the first (margins, atmeans) tells us what we could expect from the average person in the sample?

Person B: 
Under treatment vs control. Yes.The difference btw those two numbers will equal the dydx estimate.

Thursday, January 15, 2015

Bootstrapping Standard Errors: Methods and AR(1) processes

The Famous Cameroon/Gelbach paper goes through:

Non parametric bootstraps
1. Pairs Cluster Bootstrap-se and Bootstrap-t
2. Residual and wild bootstrap 
Parametric bootstraps
3. Redidual bootstrap

The first method is just to repeat your regression/correlation X number of times, and plot the distribution of all X estimates to get a nonparametric estimate of the variance. However, what's key here, is that the sampling method mimic whatever issue you're trying to correct for. If it's clustering of standard errors, then you'll want to have a clustered sampler. 

The second and third method (in some sense) simulates "new data" on each run, and then proceeds as in (1). Method 2 reassigns the initial estimated residuals (nonparametric), while method 3 randomly samples new residuals (parametric, because it assumes a distribution on the residuals) from a standard normal. Nonparametric is preferred to parametric, but isn't always possible (like in your case). 

Method (2) is popular for bootstrapping when the # of clusters is small (so you're not just resampling the same data repeatedly, which would give you a crude estimate of the true distribution). 

***Bootstrapping for an AR(1) process*******

Now, in the case of getting se's with an AR(1) process there are two options: non-parametric and parametric. With non-parametric you'll want to mimic the AR(1) process via the sampling method. With parametric you'll mimic the AR(1) process by literally generating data from an AR(1) process (see slide 23/28:http://ocw.mit.edu/courses/sloan-school-of-management/15-450-analytics-of-finance-fall-2010/lecture-notes/MIT15_450F10_lec09.pdf)

I haven't come across time series sampling akin to clustered sampling, so to bootstrap the se's on AR(1) process I don't see any other possibly other than a parametric bootstrap. 

Attached is Stata code for a basic nonparametric X,Y pairwise se-bootstrap. You can find R code here: 
http://www.r-bloggers.com/the-cluster-bootstrap/. My particular code doesn't really help your specific case, but gives an example. 

Some other links I checked out to understand this better:
http://www.stata.com/support/faqs/statistics/bootstrap-with-panel-data/
http://stat.duke.edu/~ab216/report.pdf
http://www.jstor.org/discover/10.2307/1392427?uid=3738224&uid=2&uid=4&sid=21104335031653
http://www.stat.umn.edu/geyer/5102/slides/s8.pdf
http://www.stat.umn.edu/geyer/5601/examp/parm.html
http://magician.ucsd.edu/essentials/WebBookse76.html
http://www.epixanalytics.com/modelassist/CrystalBall/Model_Assist.htm#Analysing_and_using_data/Bootstrap/Parametric_Bootstrap_example.htm
http://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf
http://www.economics.uci.edu/~dbrownst/bootmi.pdf

Session Effects and Treatment Effects Colinear in a Between Experiment

In between experiments, we look at the average effect of a treatment for a given session, and compare averages across sessions. Imagine we have two sessions: A& B, and two treatments: T1, T2
A is assigned T1
B is assigned T2

Here we can't distinguish if it's the session that caused a change in outcome behavior of the treatment. What would solve it is if we have at least 2 sessions for each treatment for a total of four sessions: A, B, C, D:

A and B are assigned T1
C and D are assigned T2

Now session and treatment are not perfectly colinear. But of course, 4 session is costlier than 2.

One paper: https://files.nyu.edu/gf35/public/print/Frechette_2011b.pdf
suggests that clustering will solve session effects. A dummy for the session and clustering within the cov-var matrix are not the same thing. The latter will take care of interactions between individuals in a session, but not the average effect of being in session A versus session B. Perhaps session A was very noisy, so everyone was distracted, but no one talked to one another--so clustering wouldn't do much to control for a noisy session.


Responder A: 
Why do you disagree with Session fixed effects?

I have 5 sessions and characteristics are evenly distributed across the sessions. Unless there is something big that I am unaware of (possible) the session effect should not vary by session. That is, any unobservable interaction should not be different from session to session and almost certainly would not be linear. Hence, I am inclined not to use session effects.

I am coming to this idea slowly. In general, I think session effects are a good default since people select into their session. But i am having colinearity problems when i include all my session effects, so that 2 dummies are dropped. Observables should not contribute differently to session effects since they are well distributed and so we are then talking about interactions, which is a whole other can of worms and make me think the linear session effects are not the way to go. My sample is probably too small to look at this rigorously.

How can i justify this approach to an audience, reader, reviewer?

Responder B:My issue with the session effect paper is that it totally confounds within-group correlations (clustering) with session effects (fixed effects). It says that feedback between subjects is a session effect. I see session effects as an average intercept shift in each session (classic fixed effect).
It also claims that clustering will handle dynamic session effects and heterogeneity in behavior.
****
Regarding your issue--I'll think out loud a bit.
I think session effects should account for anything you may not have controlled for or can't control for--e.g. there's the day of the week, the time, and maybe if there was a baby crying during the session (and you didn't know that) and people couldn't concentrate, etc.
Now, suppose you put in dummies to account for this session intercept. Why would some drop?
1. The session is perfectly collinear with another set of variables--like the treatment.
2. Suppose the session effect (relative to the left out session) is the same across a few sessions. That's feasible. It could be that the baby came to 2 sessions and not the others, and that's the only thing that differed between sessions. In that case, you wouldn't be able to estimate a different intercept for each session.
This must be an issue in cross-country papers that include country dummies? Country dummies must drop out... You might post/look here: http://stats.stackexchange.com/questions


Responder B: 
1. I can't put in session dummies into my regs, i.e. fixed effects for session, because they coincide with my treatments perfectly and drop out.
2. Does it make any sense to have a session variable (or similarly a city variable) rather than a session or city fixed effect?
namely, rather than having a dummy for each city or each session, one controls for city in a city variable that is coded as 1=NYC, 2=London etc.
What does that variable mean? And does it make any sense? When you first learn regressions you use a city variable, but now a days you use a dummy for each city.
Do you ever use one city, or country control rather than a dummy for each country or city?


Responder C:
Here is my take:
If you use an ordering of 1,2,3,4,5 for a region, it has to make sense. This is because 5 means a higher magnitude than 4 etc. Therefore there are only few situations you can go for an ordering instead of fixed effects and a time trend is one such case. For regions, I'm afraid you have to go for fixed effects. The one alternate solution is to use a variable that captures something by region (so is region specific) and use that instead of region fixed effects. If this variable comes close to approximating you region specific unobservable, you may be able to get away with it. And you can then say you control for what you think is the biggest issue, but you can use region fixed effects as it kills your main variable.
Thats my 2 cents.

Saturday, March 29, 2014

Power Calculations and Multiple Treatment Arms

There's two issues at play with more than 2 treatment arms:
1) that you your outcome variable may have different variances across treatment arms, so your sample sizes may vary across treatment arms if you're trying to get the smaller possible sample size for a given power and significance (see http://www.nber.org/papers/w15701.pdf?new_window=1, pg 8 equation 7 for the derivation of different N across treatment groups). I wrote the authors to see if they had code to accompany this derivation, and if they've thought about this derivation for more than 2 groups, and they said no.

2) multiple testing increases your type I error rate. See pg 5 of these slides. Basically, 5% significance says we have a 1 in 20 chance of rejecting the null when it is true. So If we make 20 hypothesis tests on our data, then 1 of them is bound to show up as significant.

So how do we deal with both these issues?

We could just compute the power calcs separately for each pair of Treatment and Control, i.e. 3 pairs:
T1 and C
T2 and C
T3 and C

But then wouldn't deal with the multiple testing issue in that we want to test:
H1: Y_T1 !=Y_C
H2: Y_T2 !=Y_C
H3: Y_T3 !=Y_C


We could just test:
H0: Y_T1 !=Y_T2!=Y_T3!=Y_C

and avoid the multiple testing issue, but then that doesn't tell us which treatment is effective.

The Stats Columbia Department told me that statisticians normally use simulation for this kind of method. The simulation would go something like this:

Step 1:
Generate data according to your hypothesized distributions of the outcome variable for which you're interested in detecting a change. Here is where you can impose your different variance across groups:

eg. Y_c~N(10,10)
eg. Y_T1~N(10,5)
eg. Y_T2~N(10,20)
eg. Y_T3~N(10,100)

Generate data, start with an equal number of observations between each group, say 200.



C T1 T2 T3
1 2 3 4
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
N= 200 200 200 200


Now test:
H1: Y_T1 !=Y_C
H2: Y_T2 !=Y_C
H3: Y_T3 !=Y_C

Step 2:
Repeat Step 1 about 1,000 times.

Step 3:
Calculate the percentage of times that you reject H1, H2, and H3.
Is that rejection rate > or < 5%?

Step 4:
Change up the split of N across each treatment group, raising N_i for the group that has higher variance and lowering it for the group that has lower variance.

How you'd increment the N for each group is unknown to me. This seems like you could loop through an infinite number of possibilities increasing and decreasing N in various combinations.

Have an ideas???

In general, it would be nice to code this up for all practitioners out there. My guess is that researchers, aside from the statisticians at columbia, don't do something this complicated. I can tell you that pretty well known economist told me yesterday that he doesn't even compute power. He just looks at how much money he has as a participation fee and that sets his N!

By the way, some packages out there to do vanilla power calcs are:
pwr in R and sampsi/sampclus in stata.

http://www.statmethods.net/stats/power.html
http://economics.ozier.com/owen/slides/ozier_powercalc_talk_20100914a.pdf





Thursday, December 12, 2013

Update on strata and heterogenous assignment

In my last post I was dealing with an issue where the assignment to treatment differed across strata. Duflo et. al. in the Handbook note that an OLS regression with indicators for the strata is comparable to averaging the difference between treatment vs. control across strata, where the weights are the probability of treatment conditional on being in particular strata.

The notation on pg 3935 in the handbook is quite sloppy, and so is the statement:

"In general, controlling for variables that have a large effect on the outcome can help
reduce standard errors of the estimates and thus the sample size needed. This is a reason
why baseline surveys can greatly reduce sample size requirement when the outcome

variables are persistent."

This statement seems to imply that they're suggesting this regression is correct:

y=alpha+beta*dummy_1+....+z*dummy_z+gamma*Treatment_dummy+error 

No. The correct regression is:

y=alpha+beta*dummy_1xTreatment+....+z*dummy_zxTreatment+error

where each dummy is interacted with the treatment. The ATE is now the sum of the coefficients on the dummies interacted with the treatment. 


Also, Macartan Humphreys has a well written paper on hetero effects: http://www.columbia.edu/~mh2245/papers1/monotonicity7.pdf