Comments on Probably Overthinking It: Are your data normal? Hint: no.

That's a good example of what I'm talking ...

2016-07-01T06:27:28.194-07:00

That's a good example of what I'm talking about: if you have enough data, Anderson-Darling and related tests will eventually indicate that your data are not normally distributed, but that doesn't mean a normal model would not be reasonable and useful.

Hi sir I have data of all blood pressure readings ...

2016-07-01T05:42:09.572-07:00

Hi sir
I have data of all blood pressure readings max 150 min 70. All data are recorded from normal pattern and patients looks very normal. Totally 872 reading s of 30 patients.However AD test non normal. It fits only nearly 3 parameter logo logistical distribution AD of 4,.. And p of 0.0005<. Now can I use process capability evaluation on these data just considering it as normal data? If I do so I will get excellent results

I usually go the other way: I use Bayesian methods...

2014-01-05T11:11:27.720-08:00

I usually go the other way: I use Bayesian methods to generate a posterior distribution, and then if I need an MLE, I can get it from the posterior. But if all you have is an MLE, that's not enough to determine the distribution (in general).

And here it is: http://allendowney.blogspot.com/20...

2014-01-05T11:09:10.200-08:00

And here it is: http://allendowney.blogspot.com/2013/09/how-to-consume-statistical-analysis.html

Do you ever use a maximum likelihood estimation to...

2014-01-05T10:19:38.719-08:00

Do you ever use a maximum likelihood estimation to determine distributions?

Good suggestion. Thanks! Or there's one more...

2013-08-16T12:23:54.610-07:00

Good suggestion. Thanks!

Or there's one more alternative -- you can generate several rows with size N, sort the rows, then take the average of each column. As the number of rows increases, the averages converge on the rankits.

For the normality plot, I wanted to mention a litt...

2013-08-16T12:18:46.448-07:00

For the normality plot, I wanted to mention a little trick that helps out with small sample sizes (which will have varying results because of your small random draw from the normal distribution): you can draw a much larger set of normally distributed points, sort them, and then take only a subset of those sorted points. Honestly, the code is simpler than my prose, so here is a little sample:

n = 5000 # this makes for very solid normpts
data = [820,770,710,910,840,670,830]
sdata = sorted(data)
normpts = sorted(np.random.normal(0,1,len(data)*n))[n/2::n]

2013-08-08T05:16:21.072-07:00

This comment has been removed by the author.

I am working on a talk that addresses exactly this...

2013-08-08T05:13:00.434-07:00

I am working on a talk that addresses exactly this question. I will post it soon.

Can you say something about why CDF is preferable ...

2013-08-08T02:30:51.633-07:00

Can you say something about why CDF is preferable to PDF for comparing distributions?

2013-08-07T21:32:51.732-07:00

This comment has been removed by a blog administrator.

Yes, most of my objections about this kind of dist...

2013-08-07T13:17:54.267-07:00

Yes, most of my objections about this kind of distribution testing also apply to hypothesis testing in general.

I am picking on this particular application because when people ask about distribution testing, it is almost always the wrong question. What they really want, most of the time, is help making modeling decisions.

Thanks for this comment!

What you're describing is actually a problem w...

2013-08-07T13:05:49.082-07:00

What you're describing is actually a problem with hypothesis testing in general, rejecting the null does not tell you anything about effect size, so we can fool ourselves with over-powered tests. Goodness of fit tests like Kolmogorov-Smirnov and Anderson-Darling model the empirical process as a Brownian Bridge and look for irregularities in the sup-norm or weighted L2 norm respectively. It's certainly possible to interpret an effect size here, assuming one bothers to report it, and it tells the same kind of stories that your pictures do.

For example suppose I want to know if 10^6 data points are well-modeled by a uniform distribution on [0,1] -- I can form the empirical process, which should be approximately the Brownian Bridge. The sup of the empirical process, which is after all a one-sided KS statistic is nothing more than 1000 * (max difference between empirical and desired cdf). If I reject above some level T, I can say what I actually mean, I am rejecting for errors of probability on order T/1000. If this effect size was reported, you could make a decision about the meaning of the test, and would probably conclude for many purposes that it was ridiculously over-powered.

What I'm saying is just basic hygiene when we talk about tests for a single parameter, but for some reason as soon as things get non-parametric, common sense goes out the window.

Yes! The QQ plot requires you to compare your dat...

2013-08-07T10:09:39.808-07:00

Yes! The QQ plot requires you to compare your data to a specific distribution, so it is only as good as your estimated parameters.

The normal probability plot does not require you to estimate parameters. In the example I used the standard normal distribution, but I could have used any normal distribution. This works because of the property I mentioned: normal distributions are closed under linear transformation.

Could you explain why you are using a "normal...

2013-08-07T09:42:49.026-07:00

Could you explain why you are using a "normal probability plot" instead of a QQ-plot?