Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New implementation of cov/cor with extended API #6273

Merged
merged 16 commits into from
Mar 30, 2014
Merged

New implementation of cov/cor with extended API #6273

merged 16 commits into from
Mar 30, 2014

Conversation

lindahua
Copy link
Contributor

This PR is made based on #5249.

  • Here is the extended API:
cov(x[, corrected=true, vardim=1, zeromean=false])
cov(x, y[, corrected=true, vardim=1, zeromean=false])
cor(x[, vardim=1, zeromean=false])
cor(x, y[, vardim=1, zeromean=false])

covm(x, xmean[, corrected=true, vardim=1])
covm(x, xmean, y, ymean[, corrected=true, vardim=1])
corm(x, xmean[, vardim=1])
corm(x, xmean, y, ymean[, vardim=1])

Here, the arguments x and y can be either vectors or matrices.

Below is the explanation of keyword arguments:
- corrected: scale the matrix by 1 / (n - 1) if set to true (by default), or 1 / n otherwise. Note that this option does not apply to cor and corm, as it does not affect the resultant correlation values.
- vardim: consider variables in columns and observations in rows if set to 1 (by default), or the other way round, if set to 2.
- zeromean: x and y are considered to have zero means, if set to true. Otherwise, the means will be computed from x and y and subtracted therefrom, if set to false (default).

  • The computation is internally delegated to covzm and corzm, which take centered data. These functions further delegate to unscaled_covzm, which is the core implementation. All these internal functions are not exported.
  • The functions are thoroughly tested -- I expand the test cases to ensure correctness under all combinations of keyword arguments.

@johnmyleswhite
Copy link
Member

This seems like a much better API.

@lindahua
Copy link
Contributor Author

The PR is ready for review.

@jiahao
Copy link
Member

jiahao commented Mar 27, 2014

👍

@nalimilan
Copy link
Member

Nice!

@StefanKarpinski
Copy link
Member

This is a much better API but I really would love to figure out a way to get rid of the *m variants. It was awkward with varm but now there's a whole parallel family, which is worse. Every time there are parallel families of functions like this, it's a major API design smell.

@StefanKarpinski
Copy link
Member

You used a new type for weighted vectors in StatsBase – I wonder if something like that is warranted here? I.e. a vector of values with known mean? Is that an approach we can generalize? It's kind of elegant the way it works with multiple dispatch. In that case, it may make sense to move the known mean cases out of Base, but I think that's not unreasonable.

@lindahua
Copy link
Contributor Author

In current implementation, cor relies on corm, as something like

cov(x) = covm(x, mean(x))

So it makes sense to have covm to live in Base.

Another way to design the API without the need of *m functions may be the following:

cov(x)
cov(x, y)
cov((x, xmean))
cov((x, xmean), (y, ymean))

When people want to additionally provide a known mean, he may use a tuple (x, xmean). This way also uses multiple dispatch, without introducing any new types.

@StefanKarpinski
Copy link
Member

I like that tuple approach but I wouldn't mind getting an opinion from @JeffBezanson. I feel like he might have a reason to object that isn't occurring to me.

@lindahua
Copy link
Contributor Author

If no objection, I will implement the tuple-based interface tomorrow and update the PR.

@lindahua
Copy link
Contributor Author

Consider this more, with the zeromean keyword argument, it doesn't seem that we need to use covm or other APIs that allow people to input a known mean.

Suppose you already know the mean xmean, you can always write:

cov(x .- xmean; zeromean=true)

or

z = x .- xmean
cov(z; zeromean=true)
# z can be reused later

I am considering just deprecating the *m functions altogether. Opinions?

@nalimilan
Copy link
Member

How smart! The only gotcha is that subtracting the mean implies traversing the vector once and allocating space before calling cov, while the *m variants could do the subtraction on the fly if they wanted (though I see that currently you don't do this).

Personally I really don't care about this very subtle potential difference in performance, and the variants could easily be added later if needed.

@lindahua
Copy link
Contributor Author

Currently, the implementation of covm actually just does the subtraction for the user. The varm does everything in one pass. But the performance difference doesn't seem to be very big.

My proposal: modify this PR a little bit, remove covm and corm, which simply just does x .- xmean for the user, nothing more. The varm and stdm may continue to stay if we don't want to get rid of them for now.

@StefanKarpinski
Copy link
Member

How about making the keyword mean instead:

julia> var(x; mean=Base.mean(x)) = (x,mean)
var (generic function with 1 method)

julia> var([1,2,3])
([1,2,3],2.0)

To specify the mean just do something like var(x, mean=0).

@nalimilan
Copy link
Member

@StefanKarpinski Isn't it what you proposed here? #5249 (comment) I like this syntax, but it's not clear whether the performance hit due to the keyword argument can be eliminated in the future (cf. the discussion following your comment)..

@lindahua
Copy link
Contributor Author

I am fine with the mean keyword argument. In particularly, we can do

cov(x; mean=m) = covm(x, m)

Even if the type of mean is not specialized, the only overhead is to resolve the proper covm method in runtime. As the computation in cov is usually heavy, such overhead is generally negligible. In this way, we can have covm as an internal implementation function, which need not be exported.

The problem remains API design, what about cov(x, y). Shall we use xmean and ymean, as?

cov(x; mean=m)
cov(x, y; xmean=mx, ymean=my)

@StefanKarpinski
Copy link
Member

How about mean is a triple of the same size as the number of variables?

@nalimilan
Copy link
Member

Yeah, the 2-uple sounds like the best solution.

@jiahao
Copy link
Member

jiahao commented Mar 29, 2014

How about cov(x, y; mean=(mx,my))?

@lindahua
Copy link
Contributor Author

@jiahao, that's what @StefanKarpinski suggested. I am implementing this. Will be ready soon.

@lindahua
Copy link
Contributor Author

This is the new API:

cov(x[; vardim=1, corrected=true])    # default: using Base.mean to compute mean
cov(x; mean=0[; vardim=1, corrected=true])  # zero mean
cov(x; mean=m[; vardim=1, corrected=true])   # user provided mean

cov(x, y[; vardim=1, corrected=true])    # default: using Base.mean to compute mean
cov(x, y; mean=0[; vardim=1, corrected=true])  # zero mean
cov(x, y; mean=(mx, my)[; vardim=1, corrected=true])   # user provided means

cor(x[; vardim=1])    # default: using Base.mean to compute mean
cor(x; mean=0[; vardim=1])  # zero mean
cor(x; mean=m[; vardim=1])   # user provided mean

cor(x, y[; vardim=1])    # default: using Base.mean to compute mean
cor(x, y; mean=0[; vardim=1])  # zero mean
cor(x, y; mean=(mx, my)[; vardim=1])   # user provided means

The mean, corrected and vardim keyword arguments are also added to var and std.

@lindahua
Copy link
Contributor Author

I think it is ready to merge. If no further API change is needed, I will merge this soon.

@nalimilan
Copy link
Member

This API looks perfect to me. ;-)

@StefanKarpinski
Copy link
Member

My one last (I swear) possible bikeshed is the name of the vardim keyword. Would this be better to call variable? Is vardim clearer or just a weirder name? Sorry for the additional bikeshed. Once we settle on this it would be great to never have to change it ever again :-)

@lindahua
Copy link
Contributor Author

vardim indicates the dimension of variables. So it has the meaning of dimension.

When vardim = 1, it indicates that the variables are in columns, otherwise (vardim = 2) the variables are in rows.

The name variable does not convey the meaning of dimension. At first glance, variable = 1 seems to mean the variable value is 1.

@StefanKarpinski
Copy link
Member

That seems like a reasonable argument. Merge at will!

@toivoh
Copy link
Contributor

toivoh commented Mar 29, 2014

Just to continue thd bikeshed :) In numpy, this kind of argument is
invariably called axis. I think that we should be similarly consistent.

@StefanKarpinski
Copy link
Member

I dunno. I don't find the name axis to be enlightening at all.

@toivoh
Copy link
Contributor

toivoh commented Mar 29, 2014

We're talking about which dimension to reduce along. Although I like
axis, I primarily think that we should be consistent with reduction
functions and others that take the same kind of argument. I think that
dim would be fine; I don't see that vardim adds anything beyond that.

@StefanKarpinski
Copy link
Member

Oh, I see what you're saying – not that we should use axis but that we should be consistent.

@lindahua
Copy link
Contributor Author

There are two kinds of dimensions. The dimension of each sample/observation, or the dimension of each variable. I am just being explicit here.

@StefanKarpinski
Copy link
Member

I'm not really clear on how this relates to the dim keyword argument for reductions. Does it? @toivoh, do you have some unifying view of these kinds of dimensions arguments?

@lindahua
Copy link
Contributor Author

Reduction functions actually do not use keyword arguments. They simply use the second positional argument to specify the dimensions to reduce over.

@nalimilan
Copy link
Member

cov is a kind of reduction along the dimension of variables. But indeed dim is ambiguous; vardim looks like a good choice to me, both explicit enough and similar enough to dims used in reductions.

lindahua added a commit that referenced this pull request Mar 30, 2014
New implementation of cov/cor with extended API
@lindahua lindahua merged commit 5c72672 into JuliaLang:master Mar 30, 2014
@lindahua
Copy link
Contributor Author

I will add documentation & news later.

vtjnash added a commit that referenced this pull request Jun 7, 2016
the external abi is to call var,
the internal abi doesn' need to branch to alternative functions based on whether mean is given as zero
that simply made the dispatch less straightforward to understand
even though doing an exact comparison to 0 isn't generally reliable

ref #6273, which initially introduced these functions as the public API, before changing them to be the internal implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants