-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
var(), sd() and cov() definitions #5023
Comments
We used to offer this and got rid of it. I'd be happy to add back a keyword argument called |
Thanks. So you'd follow R and Matlab's default behaviors? Wondering about the name and the form of the keyword argument, Also, do you think it would make sense to add an optional weights argument? This is often useful and many languages feel lame by not providing such elementary statistics easily. 1: http://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation |
I like the name |
d74ad66#diff-b85f81409a2a21f2174ccb078b0d7030 +1 for weighted versions as well. |
There is a |
How about "unbiased" instead of "corrected"? The computations using N are perfectly correct, so one cannot really "correct" them - one can only compute something different, for a different purpose. The purpose of using N-1 is to obtain an unbiased estimator for the corresponding population quantity. Basically the only reason you'd want that is that you are averaging a lot of such estimators, and want the result to be more accurate. That requires the base estimator to be unbiased. The clearest situation would be estimating the covariance matrix. The usual "N-1" formula gives you a horrible estimator in high dimensions, and you'll want to use a shrinkage estimator (e..g Ledoit-Wolf) unless you are doing averaging as explained above. Should the cov() function use a shrinkage estimator when one asks for a "corrected" result? If not, then clearly it is the unbiasedness that is the defining quality of the result, hence "unbiased" is a better term. If yes, then there still ought to be a more descriptive term ("corrected" how? there are many reasonable covariance matrix estimators, all different). Even "estimator" is more descriptive (the case with N isn't really intended as an estimator, it is more a property of the data set itself divorced from any notion of populations or sampling, much like its length, sum or mean is). |
My preference for |
Yeah, "unbiased" really implies that the other formula gives biased results, which is not the case when you are not estimating the population quantities. While the opposite to "unbiased" is "biased", the opposite to "corrected" is just "uncorrected", which means either "wrong" or "does not need correction to be right". So I see it as more neutral. "estimator" would also make sense to me. |
And even if the term In general I mean, doesn't everyone want their estimator to be "unbiased"? How could that not be a good thing? How could we possibly think that a "biased" estimator would be acceptable. It is just a mathematical property and not a particularly important one at that. Sorry for getting on a soap box about this. My research is related to estimating "variance components" and people do all sorts of stupid things in this area because they haven't thought through the consequences of their assumptions and just go by gut instinct influenced by loaded terminology. |
The original |
Whether you say it is a Having said all this, I am in favor of having the version with By the way, the use of I'm beginning to understand the term "bikeshedding". |
FWIW, the term |
I suppose you could call it "bias_corrected" or "bessel_corrected" or just "bessel" and still convey what exactly is being "corrected", or what correction is being applied (i.e. the most common correction for bias used in certain circumstances, as opposed to some other correction). Of course the claim of unbiasedness requires assumptions (but not normality assumption, @dmbates). The whole claim that the modification is a correction at all (as opposed to making things worse) requires pretty much the same assumptions, so "correction" is not any better on that front. Also, for Gaussians the estimate with N is the maximum likelihood one, so one could argue that the N-1 estimate is worse as it is less likely, and being worse it is not a "correction" at all, and calling it a "correction" is more loaded than calling it "unbiased", as the latter is simply a fact. As for sd(), one could call it "bessel_corrected" (or just "bessel"?), or simply leave the option out altogether. The N-1 tweak for that one is pretty much useless, as it does not correct the bias. People use it simply because it is the square root of an unbiased variance estimator, not because it is a good estimator of standard deviation as such. One can really correct the bias if one uses e.g. a normality assumption (in which case "unbiased" would work as a term), but nobody bothers, because nobody cares about standard deviations except as more human-friendly "pretty-printings" of variances. One almost always averages variances, not standard deviations, and unbiasedness really only matters when averaging (who cares about a factor of N/(N-1)≈1 when it doesn't accumulate anyway?). Leaving the whole correction out for this and requiring the user to use e.g. sqrt(var(x, unbiased=true)) wouldn't even be a bad choice if Bessel is never applied by default. |
So, it seems to me that one very simple option here would be to basically revert d74ad66#diff-b85f81409a2a21f2174ccb078b0d7030 and call the argument e.g. |
So, there would be a version with only Should |
For the (probable minority) of users who need the 1/n scaling, why can't they just multiply by (n-1)/n? Unless n is really small, the cost of this scaling should be negligible compared to the computation of |
Is it so rare to center and scale a set of values that are absolutely not drawn from any population? Of course it's easy to get the correct value, but I just find it nicer to have code calling the stock function with a documented parameter than multiplying the result by a quantity somebody reading the code for the first time will wonder where it comes from. The API cluttering is minimal, since anyway there aren't so many parameters to pass to these functions. Also, this may be useful if |
;) Jokes aside, personally I can only provide anecdotal evidence, but I think it's somewhat telling that both numpy and Matlab give the option (and numpy goes as far as providing a generalization, via the 1 it's ugly; even more so for So I'd say that is a terrible option to offer. |
Well, you can always do |
Yeah, I think his new functions cover everything here. |
Currently
var()
only allows using one definition of the variance, the unbiased estimate of the distribution variance based on an IID sample (dividing by N-1). This is also what R and Matlab do forvar()
,cov()
andstd()
.OTOH, Numpy has chosen the more basic mathematical definition dividing by N, and offers an option to use N-1 instead [2]. It also uses this behavior for
std()
, but not forcov()
[4].Are there any plans to add such an option to Julia? Scaling values using the basic mathematical (N) definition is a relatively common operation. I would support using this definition by default at least for
var()
andstd()
, but that may be a little too late. ;-)1: http://www.mathworks.fr/fr/help/matlab/ref/var.html
2: http://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html
3: http://www.mathworks.fr/fr/help/matlab/ref/std.html
4: http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
EDIT: Fix mistake about Matlab.
The text was updated successfully, but these errors were encountered: