## Monday, June 18, 2012

### Why Would We Use GLM Log-Link?

Person A:

Hi! I am finding in my current dataset that my residuals are sufficiently non-normal (using kdensity, pnorm and swilk in STATA). One Statalister suggests using glm with an appropriate link(). As I recall you used glm with log link. Can you tell me why you used that? Have you heard of it being used for non-normal errors? Right now it is either a glm procedure or a non-linear transformation of the dep var, to which I have an aversion. Any advice/info you can shed is super appreciated.

Person B:

I used glm log-link because my outcome variable (income) had a lot of zeros, so my data were right skewed. Therefore, applying a log to the outcome would normalize the distribution (i.e. make bigger numbers smaller :))

But the log of zero is undefined. This Nichols pdf explains why a glm with log link gets around that, slide 3. Short answer ln(E(y|X) if feasible for y=0, because only the expecting value of y need be positive, while E(ln(y|X) for y=0 is still undefined. Why is it called the log-link?

Well, if you convert the above into what we're used to:

E(y|x)=exp^(x'b)

Literally, we're saying that the log of y is predicted (by x'b) to vary linearly linearly. Or the exponential response predicts y linearly, and we estimate the parameters by MLE.

The errors, and therefore the expected variance are calculated using the original scale of the predictor variables. What I'm not clear on is how the assumption on the distribution of the error terms affects inference. Anyone?

Why would we want to transform the data in this way?

Well, with right skewed data for instance, high incomes are unlikely-the data are clumped near zero. So clearly, income is not normally distributed, though under OLS, it's assumed that erros are normal and so the outcome variable is as well.

## Sunday, January 29, 2012

### Hausman Test, Small Number of Clusters and Bootstrapped Standard Errors

**Person A:**

Have you ever run into estimation problems due to finite number of

clusters (M<50) or uneven cluster size (where some clusters make up

more than 5% of the sample)?

A problem that is troubling me currently is that WITHOUT clustering (at the unit level), I cannot

reject exogeneity of unit specific effects and thus would be inclined to use random effects. But once I cluster and rerun the test, I reject exogeneity. SO:

WITHOUT clustering --> cannot reject exogeneity of RE (i.e. can use RE)

WITH clustering --> reject use of RE

I know that clustering is supposed to make the se's smaller, but how would that lead to rejection of exogeneity (using xtoverid after the re, cluster regression), since se ests for re and fe estimations would

be effected that way? Could this have anything to do with finite clusters or having a few really big clusters (each of three big ones comprise of 8-11% of the sample)?

**Person B:**

- Which stata version?

- Secondly, you are using xtoverid to essentially choose between RE and FE correct?

- Thirdly, I think I read somewhere that you cant use xtoverid after clustering. but there may be a way to use the Hausman test

- Fourth, I don't think I have ever had a situation where someone says, why are you clustering? Usually they complain if you are NOT clustering. So you may be able to get away with this in a publication sense.

**Person A:**

- I am using Stata 10, all updated

- I am using xtoverid to test exogeneity of the individual specific

effect. If they come out as exogenous, I say "okay, the RE assumptions

are met and I can use that.

- I am using xtoverid because the Hausman test does not work with

clustered standard errors. xtoverid does work with clustered errors.

So, yeah, they are two different test with two different results, but

they are asymptotically equivalent; if you do xtoverid with ordinary

standard errors it is the same as the Hausman test.

- I know I should cluster, so I guess that is not the issue. the issue

is: can I use random effects? i can just say "oh, the test for

exogeneity failed so I assume the individual effects are correlated

and the RE assumptions are not met. I will go with FE. BUT I really

want to know why because I worry it means there is some bigger issue

at stake here (like issues with asymptotics due to finite number of

clusters or some clusters accounting for more than 5% of the data). If

I have these problems that can mess with the asymptotics, then my

inference can be all wrong.

- SO, I want to know if these things are impacting this weird

difference in results when clustering vs not clustering and if I need

to be worried in a more general sense OR if there is even something

else going on. Why would the results be diff with clustering?

**Person B:**

Isn't the idea behind clustering that within a cluster there is not much variance, but outside there is, and thus you want to treat each cluster as a unit?

There is a lot of grey area here that I really didn't ponder here. My gut intuition says it may have to do with unbalanced panels, and so you are right, its related to the size of the clusters.

**Iamjustapointewrote:**

Let me restate the problem to see if I understand.

__Problem A__
If you cluster, i.e. extract the variation in your explanatory variable into the error term by each cluster, then you reject RE in a Hausman test.

Namely, once the within cluster variation is removed from the regressors, then regressors (sans within cluster variation) appear to be correlated with the error term.

__Problem B__
You have a few number of clusters, i.e. <50.

So asymptotic rules at the cluster level (i.e. the betas being approximately gaussian or rather tdistributed) don't hold.

__Given A&B__
So should Problem A even be considered if we face Problem B?

***********************************************************************************

__Solution A__
Have you tried bootstrapped clustering so as not to rely on asymptotic distribution of your stats?

They consider GK small (as small as 4 in some simulations).They use bootstrap methods, which, under certain circumstances, can actually yield tighter confidence intervals than analytically "correct" (i.e. asymptotically correct) standard errors.

__Solution B__
beginning at the bottom of page 3, describes two alternatives you could try:

"One approach, suggested by Donald and Lang (2001), is to effectively treat the number of groups as the number of observations, and use finite sample analysis (with individual-specific unobservables becoming unimportant – relative to the cluster effect – as the cluster sizes get large). A second approach is to view the cluster-level covariates as imposing restrictions on cluster-specific intercepts in a set of individual-specific regression models, and then imposing and testing the restrictions using minimum distance estimation."

**Person B:**

That is a pretty cool response to small clusters (I have not yet faced that problem as I'm usually clustering on states or countries :-p But this could come in handy). One question though. Can you still use xtoverid after using boostrapped clustering?

**Iamjustapointewrote:**

I believe so, as xtoverid is used after xtreg which accepts bootstrapping:

Update -- I figured out the reason my exogeneity test fails for random effects when I allow for arbitrary heterogeneity and autocorrelation by clustering errors. Recall that one of the assumptions of Random

Effects is that uit and eit are both homoskedastic and uncorrelated across t. When I tell Stata to cluster the errors, I am relaxing this assumption. This is not fatal for RE, per se (see Wooldrige, panel data text book), but it does mean that if my se estimates change after allowing for het. and a.c. then that assumption was never sufficiently

true in the first place and I was underestimating estimating my errors. Underestimating estimating the errors would over-estimate the t-stats in any exogeneity test and lead to over rejection of the null. a separate problem I was having was that in addition to this one of my variables was correlated with the facility effect (which I was

modeling with RE) -- when I estimate things without that variable, everything behaves better. the question remains as to what to do from here, but at least I figured out that a) my cluster size should not be an issue (in fact, cluster size is WAY less important than number of clusters) and b) the estimation method that Stata's cluster command

uses does 'reasonably well' with small number of clusters (G=10). with G=30, I fall between small and enough (safe is considered G=50, but some people say 30 is fine). In any case, cluster size or number does not seem to be my problem. Nonetheless, bootstrapping has been shown to perform better than clustering, esp. with fixed effects (not sure about re's) so that is my next step.

http://www.stata.com/statalist/archive/2010-04/msg01412.html

You can also bootstrap "by hand":

}

You can also bootstrap "by hand":

*BOOTSTRAPPED STD ERRORS

local B = 1000

matrix bs = J(`B', 1, 0)

forvalues b = 1(1)`B' {

qui {

qui {

* NOTE: use "preserve"/"restore" to bsample from original dataset for each iteration

preserve

bsample, cluster(cluster unit)

capture drop xb lamda

probit y x1 x2

predict xb, xb

gen lamda = normalden(xb) / normal(xb)

reg log_wage edyrs age lamda

matrix e = e(b)

matrix bs[`b', 1] = e[1,1]

restore

}

}

svmat bs

summ bs, det

* store std deviation from summarize command using return fn

local bs_se = r(sd)

di "bootstrapped standard error: `bs_se' ..."

add_stat "bs_se" `bs_se'

**Person A:**

Update -- I figured out the reason my exogeneity test fails for random effects when I allow for arbitrary heterogeneity and autocorrelation by clustering errors. Recall that one of the assumptions of Random

Effects is that uit and eit are both homoskedastic and uncorrelated across t. When I tell Stata to cluster the errors, I am relaxing this assumption. This is not fatal for RE, per se (see Wooldrige, panel data text book), but it does mean that if my se estimates change after allowing for het. and a.c. then that assumption was never sufficiently

true in the first place and I was underestimating estimating my errors. Underestimating estimating the errors would over-estimate the t-stats in any exogeneity test and lead to over rejection of the null. a separate problem I was having was that in addition to this one of my variables was correlated with the facility effect (which I was

modeling with RE) -- when I estimate things without that variable, everything behaves better. the question remains as to what to do from here, but at least I figured out that a) my cluster size should not be an issue (in fact, cluster size is WAY less important than number of clusters) and b) the estimation method that Stata's cluster command

uses does 'reasonably well' with small number of clusters (G=10). with G=30, I fall between small and enough (safe is considered G=50, but some people say 30 is fine). In any case, cluster size or number does not seem to be my problem. Nonetheless, bootstrapping has been shown to perform better than clustering, esp. with fixed effects (not sure about re's) so that is my next step.

### Python Rocks

Attended a great Python Workshop led by Hackers DC. Detailed and fast paced. Learned a ton.

Can't wait to write my own API!!!

Cool things I learned:

Can't wait to write my own API!!!

Cool things I learned:

- Cool api for newspaper information http://www.seomoz.org/blog/
seomoz-free-api-and-enough- power-to-build-open-site- explorer - Cron: A program that lets you
**execute scripts automatically**at specified time/date http://unixgeeks.org/security/newbie/unix/cron-1.html - ipython : install ipython to have autocomplete

- Join columns with tabs or commas

columns = []

for header in headers:

columns.append(header.text)

print '

**\t'.join(columns)**
%or

print

**','.join(columns)**- Generic try statement with pass instead of specifying an error:

try:

bank_name = data[0].text

print bank_name

**except:**

**pass**

- Randomize the frequency of url hits

%time.random

%fake the user client

random.choice(range(1,10))

### Regular Expressions in Stata

Useful to destring elements:

http://www.stata.com/support/faqs/data/regex.html

http://statadaily.wordpress.com/2011/06/10/destring-complication/

http://www.stata.com/support/faqs/data/regex.html

http://statadaily.wordpress.com/2011/06/10/destring-complication/

**tab***var*if regexm(*var*, “[^0-9 .]“)## Thursday, January 26, 2012

### MIC

At last night's Data Science meetup in DC we heard about the Maximal Information Coefficient led by Sean Murphy, recently developed by two brothers.

It's a non-parametric statistic which helps determine meaningful correlations between a slew of variables, but with out assuming a linear relationship as a correlation matrix or principal components/factor analysis methods do.

It's useful when you're trying to look for first swipes at data.

## Thursday, January 19, 2012

### Latest and Greatest of Metrics with Networks

A well informed friend told me about the latest MIT grad development students working the metrics behind networks:

http://econ-www.mit.edu/files/6909

Which we all know is majorly hopeless because of the reflection problem, and difficulty creating exogenous shocks to social networks (except in my dissertation :).

There's some gnarly modeling of individuals correlation to other individuals in the var-cov matrix (this is a parametric estimation), as well as a relaxation of assumptions, like independence across individuals' networks in a sample.

http://econ-www.mit.edu/files/6909

Which we all know is majorly hopeless because of the reflection problem, and difficulty creating exogenous shocks to social networks (except in my dissertation :).

There's some gnarly modeling of individuals correlation to other individuals in the var-cov matrix (this is a parametric estimation), as well as a relaxation of assumptions, like independence across individuals' networks in a sample.

Subscribe to:
Posts (Atom)