Monday, July 11, 2016

Power Calcs; R versus Stata



I found this nice post with R code for more complicated power calculations using simulations - which is nice, because it's not the black box.

http://egap.org/methods-guides/10-things-you-need-know-about-statistical-power

They provide 3 stages of power calculations:

  1. Two sided power calculation - no clustering, covariates, or multiple treatment arms
  2. Two sided power calculation with covariates - no clustering, or multiple treatment arms
  3. Two sided power calculation with multiple treatment arms 

  1. Two sided power calculation - no clustering, covariates, or multiple treatment arms
First I compare (1) to Stata, and they report the same sample size: N = 500, split between T1 and Control, assuming the SD for both samples is the same (which, is generally an erroneous assumption. You expect your treatment group to have greater variation in outcomes. But so it is.)

R


Stata 12.1 ( sampsi 60 65, alpha(.05) power(.8) sd(20))


2. Two sided power calculation with covariates - no clustering, or multiple treatment arms

Then I look at their calcs with covariates. When you add in covariates that creates noisier data. When you don't adjust for the reduce sd on estimates that covariates bring you later down the road, then you're overstating N. See the difference between sample_size = 1500 now and sample_size.cov = 500. (I didn't do this in Stata. I think power in Stata 13 lets you add in covariates, and I'm not bothering writing and ado for Stata to do this.)



              3. Two sided power calculation with multiple treatment arms 


First, let's assume the same effect size in both (5) for each treatment group. If we consider only making one comparison (ttest), then N ~ 600.


Stata 12.1 (sampsi 60 65, alpha(.05) power(.8) sd(20) ratio(3))

And that's what Stata's Sampsi command gives too, N= 600 , if we assume the same effect sizes by each treatment and no adjustment to the needed p-value for multiple testing.


In R simulations, the simulations that correct the p-values for multiple testing, show that we don't acheive 80% power with a sample of 2,000 or less.  (possible.ns[which(power.fullranking > 0.8)[1]] returns NA and you can see that series in blue is flat).

      4. The final step that they don't do is to also adjust for clustering. So I add this in. 
I'm working on adding this in to the code in R and rewriting it in Python... tbc.




Monday, April 11, 2016

Looping over numbered vars in Stata vs Python

If you want to loop over a list of variables that increment with a number nested inside the variable name, you can:

Stata

foreach i of numlist 1(1)34 {
      summ var`i'
}

Python. 


for i in range(1,34):
    print(np.mean(eval("x" + str(i))))

Wednesday, January 13, 2016

Margins Command in Stata

This walks through an example to help explain the margins command in Stata for nonlinear regressions.

I use the psi example (*https://www3.nd.edu/~rwilliam/stats3/Margins02.pdf). I was getting confused between the 0.psi and 1.psi in the header and what the coefficeints under Margin in the table were. 

(A)
use http://www3.nd.edu/~rwilliam/statafiles/glm-logit.dta, clear
logit grade gpa tuce i.psi, nolog
margins psi,  atmeans

1.psi (.4375)  is just average psi (or like average females), and it's just the perecentage of psi or females. and 0.psi is 1 - 1.psi. You can just type summ psi, to get .4375


(B)
psi==0 and psi==1 are the probabilities of being psi==0 or psi==1 and the difference (.5632555 - .1067571  ) is the increase in the probability of going from 0 to 1 (where all other variables are held constant at the mean), so
this difference is the same thing as the coefficient on 1.psi ( .4564984  ) after running:
logit grade gpa tuce i.psi, nolog
margins, dydx(*) atmeans

Finally, the delta in psi==0 to psi==1, and dydx@psi at means, will be the same as the marginal effect of that dichotomous variable in a linear regression. Just the standard errors change.
I think that's what I've figured out from all of this so far. Aslo the ucla site was better for me than teh pdf



***

Person A: 
I am still not clear on the difference between margins, atmeans and margins,dydx atmeans. If I run a mutlivariate regression ( y on x1, x2, and x3) and I want the marginal effect for each x, which command do I use? Do I put my varlist after margins, or ‎inside dydx()?

Person B: 
Margins, at means predicts the probabilities of your dichotomous outcome variable holding all other variables constant at their means. 
Margins, dydx at means will calculate the marginal effect of all your variables at the mean value (so if it's a gnarly function, it will only take the derivative at the center of that function).
You want the second: margins, dydx(michelle's variables) atmeans

Person A: 
So the first (margins, atmeans) tells us what we could expect from the average person in the sample?

Person B: 
Under treatment vs control. Yes.The difference btw those two numbers will equal the dydx estimate.