-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New implementation of cov/cor with extended API #6273
Conversation
This seems like a much better API. |
The PR is ready for review. |
👍 |
Nice! |
This is a much better API but I really would love to figure out a way to get rid of the |
You used a new type for weighted vectors in StatsBase – I wonder if something like that is warranted here? I.e. a vector of values with known mean? Is that an approach we can generalize? It's kind of elegant the way it works with multiple dispatch. In that case, it may make sense to move the known mean cases out of Base, but I think that's not unreasonable. |
In current implementation, cov(x) = covm(x, mean(x)) So it makes sense to have Another way to design the API without the need of
When people want to additionally provide a known mean, he may use a tuple |
I like that tuple approach but I wouldn't mind getting an opinion from @JeffBezanson. I feel like he might have a reason to object that isn't occurring to me. |
If no objection, I will implement the tuple-based interface tomorrow and update the PR. |
Consider this more, with the Suppose you already know the mean cov(x .- xmean; zeromean=true) or z = x .- xmean
cov(z; zeromean=true)
# z can be reused later I am considering just deprecating the |
How smart! The only gotcha is that subtracting the mean implies traversing the vector once and allocating space before calling Personally I really don't care about this very subtle potential difference in performance, and the variants could easily be added later if needed. |
Currently, the implementation of My proposal: modify this PR a little bit, remove |
How about making the keyword julia> var(x; mean=Base.mean(x)) = (x,mean)
var (generic function with 1 method)
julia> var([1,2,3])
([1,2,3],2.0) To specify the mean just do something like |
@StefanKarpinski Isn't it what you proposed here? #5249 (comment) I like this syntax, but it's not clear whether the performance hit due to the keyword argument can be eliminated in the future (cf. the discussion following your comment).. |
I am fine with the
Even if the type of The problem remains API design, what about cov(x; mean=m)
cov(x, y; xmean=mx, ymean=my) |
How about mean is a triple of the same size as the number of variables? |
Yeah, the 2-uple sounds like the best solution. |
How about |
@jiahao, that's what @StefanKarpinski suggested. I am implementing this. Will be ready soon. |
This is the new API: cov(x[; vardim=1, corrected=true]) # default: using Base.mean to compute mean
cov(x; mean=0[; vardim=1, corrected=true]) # zero mean
cov(x; mean=m[; vardim=1, corrected=true]) # user provided mean
cov(x, y[; vardim=1, corrected=true]) # default: using Base.mean to compute mean
cov(x, y; mean=0[; vardim=1, corrected=true]) # zero mean
cov(x, y; mean=(mx, my)[; vardim=1, corrected=true]) # user provided means
cor(x[; vardim=1]) # default: using Base.mean to compute mean
cor(x; mean=0[; vardim=1]) # zero mean
cor(x; mean=m[; vardim=1]) # user provided mean
cor(x, y[; vardim=1]) # default: using Base.mean to compute mean
cor(x, y; mean=0[; vardim=1]) # zero mean
cor(x, y; mean=(mx, my)[; vardim=1]) # user provided means The |
I think it is ready to merge. If no further API change is needed, I will merge this soon. |
This API looks perfect to me. ;-) |
My one last (I swear) possible bikeshed is the name of the |
When The name |
That seems like a reasonable argument. Merge at will! |
Just to continue thd bikeshed :) In numpy, this kind of argument is |
I dunno. I don't find the name |
We're talking about which dimension to reduce along. Although I like |
Oh, I see what you're saying – not that we should use |
There are two kinds of dimensions. The dimension of each sample/observation, or the dimension of each variable. I am just being explicit here. |
I'm not really clear on how this relates to the |
Reduction functions actually do not use keyword arguments. They simply use the second positional argument to specify the dimensions to reduce over. |
|
New implementation of cov/cor with extended API
I will add documentation & news later. |
the external abi is to call var, the internal abi doesn' need to branch to alternative functions based on whether mean is given as zero that simply made the dispatch less straightforward to understand even though doing an exact comparison to 0 isn't generally reliable ref #6273, which initially introduced these functions as the public API, before changing them to be the internal implementation
This PR is made based on #5249.
Here, the arguments
x
andy
can be either vectors or matrices.Below is the explanation of keyword arguments:
-
corrected
: scale the matrix by1 / (n - 1)
if set to true (by default), or1 / n
otherwise. Note that this option does not apply tocor
andcorm
, as it does not affect the resultant correlation values.-
vardim
: consider variables in columns and observations in rows if set to 1 (by default), or the other way round, if set to 2.-
zeromean
:x
andy
are considered to have zero means, if set to true. Otherwise, the means will be computed fromx
andy
and subtracted therefrom, if set to false (default).covzm
andcorzm
, which take centered data. These functions further delegate tounscaled_covzm
, which is the core implementation. All these internal functions are not exported.