**Person A:**

Have you ever run into estimation problems due to finite number of

clusters (M<50) or uneven cluster size (where some clusters make up

more than 5% of the sample)?

A problem that is troubling me currently is that WITHOUT clustering (at the unit level), I cannot

reject exogeneity of unit specific effects and thus would be inclined to use random effects. But once I cluster and rerun the test, I reject exogeneity. SO:

WITHOUT clustering --> cannot reject exogeneity of RE (i.e. can use RE)

WITH clustering --> reject use of RE

I know that clustering is supposed to make the se's smaller, but how would that lead to rejection of exogeneity (using xtoverid after the re, cluster regression), since se ests for re and fe estimations would

be effected that way? Could this have anything to do with finite clusters or having a few really big clusters (each of three big ones comprise of 8-11% of the sample)?

**Person B:**

- Which stata version?

- Secondly, you are using xtoverid to essentially choose between RE and FE correct?

- Thirdly, I think I read somewhere that you cant use xtoverid after clustering. but there may be a way to use the Hausman test

- Fourth, I don't think I have ever had a situation where someone says, why are you clustering? Usually they complain if you are NOT clustering. So you may be able to get away with this in a publication sense.

**Person A:**

- I am using Stata 10, all updated

- I am using xtoverid to test exogeneity of the individual specific

effect. If they come out as exogenous, I say "okay, the RE assumptions

are met and I can use that.

- I am using xtoverid because the Hausman test does not work with

clustered standard errors. xtoverid does work with clustered errors.

So, yeah, they are two different test with two different results, but

they are asymptotically equivalent; if you do xtoverid with ordinary

standard errors it is the same as the Hausman test.

- I know I should cluster, so I guess that is not the issue. the issue

is: can I use random effects? i can just say "oh, the test for

exogeneity failed so I assume the individual effects are correlated

and the RE assumptions are not met. I will go with FE. BUT I really

want to know why because I worry it means there is some bigger issue

at stake here (like issues with asymptotics due to finite number of

clusters or some clusters accounting for more than 5% of the data). If

I have these problems that can mess with the asymptotics, then my

inference can be all wrong.

- SO, I want to know if these things are impacting this weird

difference in results when clustering vs not clustering and if I need

to be worried in a more general sense OR if there is even something

else going on. Why would the results be diff with clustering?

**Person B:**

Isn't the idea behind clustering that within a cluster there is not much variance, but outside there is, and thus you want to treat each cluster as a unit?

There is a lot of grey area here that I really didn't ponder here. My gut intuition says it may have to do with unbalanced panels, and so you are right, its related to the size of the clusters.

**Iamjustapointewrote:**

Let me restate the problem to see if I understand.

__Problem A__
If you cluster, i.e. extract the variation in your explanatory variable into the error term by each cluster, then you reject RE in a Hausman test.

Namely, once the within cluster variation is removed from the regressors, then regressors (sans within cluster variation) appear to be correlated with the error term.

__Problem B__
You have a few number of clusters, i.e. <50.

So asymptotic rules at the cluster level (i.e. the betas being approximately gaussian or rather tdistributed) don't hold.

__Given A&B__
So should Problem A even be considered if we face Problem B?

***********************************************************************************

__Solution A__
Have you tried bootstrapped clustering so as not to rely on asymptotic distribution of your stats?

They consider GK small (as small as 4 in some simulations).They use bootstrap methods, which, under certain circumstances, can actually yield tighter confidence intervals than analytically "correct" (i.e. asymptotically correct) standard errors.

__Solution B__
beginning at the bottom of page 3, describes two alternatives you could try:

"One approach, suggested by Donald and Lang (2001), is to effectively treat the number of groups as the number of observations, and use finite sample analysis (with individual-specific unobservables becoming unimportant – relative to the cluster effect – as the cluster sizes get large). A second approach is to view the cluster-level covariates as imposing restrictions on cluster-specific intercepts in a set of individual-specific regression models, and then imposing and testing the restrictions using minimum distance estimation."

**Person B:**

That is a pretty cool response to small clusters (I have not yet faced that problem as I'm usually clustering on states or countries :-p But this could come in handy). One question though. Can you still use xtoverid after using boostrapped clustering?

**Iamjustapointewrote:**

I believe so, as xtoverid is used after xtreg which accepts bootstrapping:

Update -- I figured out the reason my exogeneity test fails for random effects when I allow for arbitrary heterogeneity and autocorrelation by clustering errors. Recall that one of the assumptions of Random

Effects is that uit and eit are both homoskedastic and uncorrelated across t. When I tell Stata to cluster the errors, I am relaxing this assumption. This is not fatal for RE, per se (see Wooldrige, panel data text book), but it does mean that if my se estimates change after allowing for het. and a.c. then that assumption was never sufficiently

true in the first place and I was underestimating estimating my errors. Underestimating estimating the errors would over-estimate the t-stats in any exogeneity test and lead to over rejection of the null. a separate problem I was having was that in addition to this one of my variables was correlated with the facility effect (which I was

modeling with RE) -- when I estimate things without that variable, everything behaves better. the question remains as to what to do from here, but at least I figured out that a) my cluster size should not be an issue (in fact, cluster size is WAY less important than number of clusters) and b) the estimation method that Stata's cluster command

uses does 'reasonably well' with small number of clusters (G=10). with G=30, I fall between small and enough (safe is considered G=50, but some people say 30 is fine). In any case, cluster size or number does not seem to be my problem. Nonetheless, bootstrapping has been shown to perform better than clustering, esp. with fixed effects (not sure about re's) so that is my next step.

http://www.stata.com/statalist/archive/2010-04/msg01412.html

You can also bootstrap "by hand":

}

You can also bootstrap "by hand":

*BOOTSTRAPPED STD ERRORS

local B = 1000

matrix bs = J(`B', 1, 0)

forvalues b = 1(1)`B' {

qui {

qui {

* NOTE: use "preserve"/"restore" to bsample from original dataset for each iteration

preserve

bsample, cluster(cluster unit)

capture drop xb lamda

probit y x1 x2

predict xb, xb

gen lamda = normalden(xb) / normal(xb)

reg log_wage edyrs age lamda

matrix e = e(b)

matrix bs[`b', 1] = e[1,1]

restore

}

}

svmat bs

summ bs, det

* store std deviation from summarize command using return fn

local bs_se = r(sd)

di "bootstrapped standard error: `bs_se' ..."

add_stat "bs_se" `bs_se'

**Person A:**

Update -- I figured out the reason my exogeneity test fails for random effects when I allow for arbitrary heterogeneity and autocorrelation by clustering errors. Recall that one of the assumptions of Random

Effects is that uit and eit are both homoskedastic and uncorrelated across t. When I tell Stata to cluster the errors, I am relaxing this assumption. This is not fatal for RE, per se (see Wooldrige, panel data text book), but it does mean that if my se estimates change after allowing for het. and a.c. then that assumption was never sufficiently

true in the first place and I was underestimating estimating my errors. Underestimating estimating the errors would over-estimate the t-stats in any exogeneity test and lead to over rejection of the null. a separate problem I was having was that in addition to this one of my variables was correlated with the facility effect (which I was

modeling with RE) -- when I estimate things without that variable, everything behaves better. the question remains as to what to do from here, but at least I figured out that a) my cluster size should not be an issue (in fact, cluster size is WAY less important than number of clusters) and b) the estimation method that Stata's cluster command

uses does 'reasonably well' with small number of clusters (G=10). with G=30, I fall between small and enough (safe is considered G=50, but some people say 30 is fine). In any case, cluster size or number does not seem to be my problem. Nonetheless, bootstrapping has been shown to perform better than clustering, esp. with fixed effects (not sure about re's) so that is my next step.

## No comments:

## Post a Comment