Monday, June 18, 2012

Why Would We Use GLM Log-Link?

Person A:

Hi! I am finding in my current dataset that my residuals are sufficiently non-normal (using kdensity, pnorm and swilk in STATA). One Statalister suggests using glm with an appropriate link(). As I recall you used glm with log link. Can you tell me why you used that? Have you heard of it being used for non-normal errors? Right now it is either a glm procedure or a non-linear transformation of the dep var, to which I have an aversion. Any advice/info you can shed is super appreciated.

Person B:

I used glm log-link because my outcome variable (income) had a lot of zeros, so my data were right skewed. Therefore, applying a log to the outcome would normalize the distribution (i.e. make bigger numbers smaller :))

But the log of zero is undefined. This Nichols pdf explains why a glm with log link gets around that, slide 3. Short answer ln(E(y|X) if feasible for y=0, because only the expecting value of y need be positive, while E(ln(y|X) for y=0 is still undefined. Why is it called the log-link?

Well, if you convert the above into what we're used to:

Literally, we're saying that the log of y is predicted (by x'b) to vary linearly linearly. Or the exponential response predicts y linearly, and we estimate the parameters by MLE.

The errors, and therefore the expected variance are calculated using the original scale of the predictor variables. What I'm not clear on is how the assumption on the distribution of the error terms affects inference. Anyone?

Why would we want to transform the data in this way?
Well, with right skewed data for instance, high incomes are unlikely-the data are clumped near zero. So clearly, income is not normally distributed, though under OLS, it's assumed that erros are normal and so the outcome variable is as well.

Sunday, January 29, 2012

Hausman Test, Small Number of Clusters and Bootstrapped Standard Errors

Person A:
Have you ever run into estimation problems due to finite number of
clusters (M<50) or uneven cluster size (where some clusters make up
more than 5% of the sample)?

A problem that is troubling me currently is that WITHOUT clustering (at the unit level), I cannot
reject exogeneity of unit specific effects and thus would be inclined to use random effects. But once I cluster and rerun the test, I reject exogeneity. SO:

WITHOUT clustering --> cannot reject exogeneity of RE (i.e. can use RE)
WITH clustering --> reject use of RE

I know that clustering is supposed to make the se's smaller, but how would that lead to rejection of exogeneity (using xtoverid after the re, cluster regression), since se ests for re and fe estimations would
be effected that way? Could this have anything to do with finite clusters or having a few really big clusters (each of three big ones comprise of 8-11% of the sample)?

Person B:
 - Which stata version?
 - Secondly, you are using xtoverid to essentially choose between RE and FE correct?
 - Thirdly, I think I read somewhere that you cant use xtoverid after clustering. but there may be a way to use the Hausman test
 - Fourth, I don't think I have ever had a situation where someone says, why are you clustering? Usually they complain if you are NOT clustering. So you may be able to get away with this in a publication sense.

Person A:
- I am using Stata 10, all updated
- I am using xtoverid to test exogeneity of the individual specific
effect. If they come out as exogenous, I say "okay, the RE assumptions
are met and I can use that.
- I am using xtoverid because the Hausman test does not work with
clustered standard errors. xtoverid does work with clustered errors.
So, yeah, they are two different test with two different results, but
they are asymptotically equivalent; if you do xtoverid with ordinary
standard errors it is the same as the Hausman test.
- I know I should cluster, so I guess that is not the issue. the issue
is: can I use random effects? i can just say "oh, the test for
exogeneity failed so I assume the individual effects are correlated
and the RE assumptions are not met. I will go with FE. BUT I really
want to know why because I worry it means there is some bigger issue
at stake here (like issues with asymptotics due to finite number of
clusters or some clusters accounting for more than 5% of the data). If
I have these problems that can mess with the asymptotics, then my
inference can be all wrong.
- SO, I want to know if these things are impacting this weird
difference in results when clustering vs not clustering and if I need
to be worried in a more general sense OR if there is even something
else going on. Why would the results be diff with clustering?

Person B:
Isn't the idea behind clustering that within a cluster there is not much variance, but outside there is, and thus you want to treat each cluster as a unit?

There is a lot of grey area here that I really didn't ponder here. My gut intuition says it may have to do with unbalanced panels, and so you are right, its related to the size of the clusters. 

Let me restate the problem to see if I understand. 

Problem A
If you cluster, i.e. extract the variation in your explanatory variable into the error term by each cluster,  then you reject RE in a Hausman test. 

Namely, once the within cluster variation is removed from the regressors, then regressors (sans within cluster variation) appear to be correlated with the error term. 

Problem B
You have a few number of clusters, i.e. <50. 
So asymptotic rules at the cluster level (i.e. the betas being approximately gaussian or rather tdistributed) don't hold. 

Given A&B
So should Problem A even be considered if we face Problem B?

Solution A
Have you tried bootstrapped clustering so as not to rely on asymptotic distribution of your stats?

I have had this problem in the past, and bootstrapping assuaged concerns with my small number of clusters. Asymptotic assumptions of statistics depend on CLT and large sample size. Inference with bootstrapping does not because repeated sampling (with added noise) is essentially re-creating the distribution rather than assuming it.  

See: Cameron, Gelbach & Miller, REStat, 2007, "Bootstrap-based Improvements for Inference with Clustered Errors"

They consider GK small (as small as 4 in some simulations).They use bootstrap methods, which, under certain circumstances, can actually yield tighter confidence intervals than analytically "correct" (i.e. asymptotically correct) standard errors.

Solution B
beginning at the bottom of page 3, describes two alternatives you could try:

"One approach, suggested by Donald and Lang (2001), is to effectively treat the number of groups as the number of observations, and use finite sample analysis (with individual-specific unobservables becoming unimportant – relative to the cluster effect – as the cluster sizes get large). A second approach is to view the cluster-level covariates as imposing restrictions on cluster-specific intercepts in a set of individual-specific regression models, and then imposing and testing the restrictions using minimum distance estimation."

Person B:
That is a pretty cool response to small clusters (I have not yet faced that problem as I'm usually clustering on states or countries :-p But this could come in handy). One question though. Can you still use xtoverid after using boostrapped clustering?

I believe so, as xtoverid is used after xtreg which accepts bootstrapping:

You can also bootstrap "by hand":

local B = 1000
matrix bs = J(`B', 1, 0)
forvalues b = 1(1)`B' {

qui {

  * NOTE: use "preserve"/"restore" to bsample from original dataset for each iteration
     bsample, cluster(cluster unit)
     capture drop xb lamda
     probit y x1 x2 
     predict xb, xb
     gen lamda = normalden(xb) / normal(xb)
     reg log_wage edyrs age lamda  
     matrix e = e(b)
     matrix bs[`b', 1] = e[1,1]


svmat bs
summ bs, det
* store std deviation from summarize command using return fn
local bs_se = r(sd)
di "bootstrapped standard error: `bs_se' ..."
add_stat "bs_se" `bs_se'

Person A:

Update -- I figured out the reason my exogeneity test fails for random effects when I allow for arbitrary heterogeneity and autocorrelation by clustering errors. Recall that one of the assumptions of Random
Effects is that uit and eit are both homoskedastic and uncorrelated across t. When I tell Stata to cluster the errors, I am relaxing this assumption. This is not fatal for RE, per se (see Wooldrige, panel data text book), but it does mean that if my se estimates change after allowing for het. and a.c. then that assumption was never sufficiently
true in the first place and I was underestimating estimating my errors. Underestimating estimating the errors would over-estimate the t-stats in any exogeneity test and lead to over rejection of the null. a separate problem I was having was that in addition to this one of my variables was correlated with the facility effect (which I was
modeling with RE) -- when I estimate things without that variable, everything behaves better. the question remains as to what to do from here, but at least I figured out that a) my cluster size should not be an issue (in fact, cluster size is WAY less important than number of clusters) and b) the estimation method that Stata's cluster command
uses does 'reasonably well' with small number of clusters (G=10). with G=30, I fall between small and enough (safe is considered G=50, but some people say 30 is fine). In any case, cluster size or number does not seem to be my problem. Nonetheless, bootstrapping has been shown to perform better than clustering, esp. with fixed effects (not sure about re's) so that is my next step.

Python Rocks

Attended a great Python Workshop led by Hackers DC. Detailed and fast paced. Learned a ton.
Can't wait to write my own API!!!

Cool things I learned:

  • Join columns with tabs or commas
columns = []
for header in headers:
print '\t'.join(columns)
print ','.join(columns)

  • Generic try statement with pass instead of specifying an error:
       bank_name = data[0].text
       print bank_name

  • Randomize the frequency of url hits
%fake the user client

Regular Expressions in Stata

Useful to destring elements:

tab var if regexm(var, “[^0-9 .]“)

Thursday, January 26, 2012


At last night's Data Science meetup in DC we heard about the Maximal Information Coefficient led by Sean Murphy, recently developed by two brothers. 

It's a non-parametric statistic which helps determine meaningful correlations between a slew of variables, but with out assuming a linear relationship as a correlation matrix or principal components/factor analysis methods do. 

It's useful when you're trying to look for first swipes at data. 

Thursday, January 19, 2012

Latest and Greatest of Metrics with Networks

A well informed friend told me about the latest MIT grad development students working the metrics behind networks:

Which we all know is majorly hopeless because of the reflection problem, and difficulty creating exogenous shocks to social networks (except in my dissertation :).

There's some gnarly modeling of individuals correlation to other individuals in the var-cov matrix (this is a parametric estimation), as well as a relaxation of assumptions, like independence across individuals' networks in a sample.