Monday, June 18, 2012
Hi! I am finding in my current dataset that my residuals are sufficiently non-normal (using kdensity, pnorm and swilk in STATA). One Statalister suggests using glm with an appropriate link(). As I recall you used glm with log link. Can you tell me why you used that? Have you heard of it being used for non-normal errors? Right now it is either a glm procedure or a non-linear transformation of the dep var, to which I have an aversion. Any advice/info you can shed is super appreciated.
I used glm log-link because my outcome variable (income) had a lot of zeros, so my data were right skewed. Therefore, applying a log to the outcome would normalize the distribution (i.e. make bigger numbers smaller :))
But the log of zero is undefined. This Nichols pdf explains why a glm with log link gets around that, slide 3. Short answer ln(E(y|X) if feasible for y=0, because only the expecting value of y need be positive, while E(ln(y|X) for y=0 is still undefined. Why is it called the log-link?
Well, if you convert the above into what we're used to:
Literally, we're saying that the log of y is predicted (by x'b) to vary linearly linearly. Or the exponential response predicts y linearly, and we estimate the parameters by MLE.
The errors, and therefore the expected variance are calculated using the original scale of the predictor variables. What I'm not clear on is how the assumption on the distribution of the error terms affects inference. Anyone?
Why would we want to transform the data in this way?
Well, with right skewed data for instance, high incomes are unlikely-the data are clumped near zero. So clearly, income is not normally distributed, though under OLS, it's assumed that erros are normal and so the outcome variable is as well.