New implementation of cov/cor with extended API #6273

lindahua · 2014-03-27T00:35:11Z

This PR is made based on #5249.

Here is the extended API:

cov(x[, corrected=true, vardim=1, zeromean=false])
cov(x, y[, corrected=true, vardim=1, zeromean=false])
cor(x[, vardim=1, zeromean=false])
cor(x, y[, vardim=1, zeromean=false])

covm(x, xmean[, corrected=true, vardim=1])
covm(x, xmean, y, ymean[, corrected=true, vardim=1])
corm(x, xmean[, vardim=1])
corm(x, xmean, y, ymean[, vardim=1])

Here, the arguments x and y can be either vectors or matrices.

Below is the explanation of keyword arguments:
- corrected: scale the matrix by 1 / (n - 1) if set to true (by default), or 1 / n otherwise. Note that this option does not apply to cor and corm, as it does not affect the resultant correlation values.
- vardim: consider variables in columns and observations in rows if set to 1 (by default), or the other way round, if set to 2.
- zeromean: x and y are considered to have zero means, if set to true. Otherwise, the means will be computed from x and y and subtracted therefrom, if set to false (default).

The computation is internally delegated to covzm and corzm, which take centered data. These functions further delegate to unscaled_covzm, which is the core implementation. All these internal functions are not exported.
The functions are thoroughly tested -- I expand the test cases to ensure correctness under all combinations of keyword arguments.

johnmyleswhite · 2014-03-27T00:36:21Z

This seems like a much better API.

lindahua · 2014-03-27T02:26:51Z

The PR is ready for review.

jiahao · 2014-03-27T04:01:45Z

👍

nalimilan · 2014-03-27T09:16:49Z

Nice!

StefanKarpinski · 2014-03-27T14:07:15Z

This is a much better API but I really would love to figure out a way to get rid of the *m variants. It was awkward with varm but now there's a whole parallel family, which is worse. Every time there are parallel families of functions like this, it's a major API design smell.

StefanKarpinski · 2014-03-27T14:09:22Z

You used a new type for weighted vectors in StatsBase – I wonder if something like that is warranted here? I.e. a vector of values with known mean? Is that an approach we can generalize? It's kind of elegant the way it works with multiple dispatch. In that case, it may make sense to move the known mean cases out of Base, but I think that's not unreasonable.

lindahua · 2014-03-27T15:01:02Z

In current implementation, cor relies on corm, as something like

cov(x) = covm(x, mean(x))

So it makes sense to have covm to live in Base.

Another way to design the API without the need of *m functions may be the following:

cov(x)
cov(x, y)
cov((x, xmean))
cov((x, xmean), (y, ymean))

When people want to additionally provide a known mean, he may use a tuple (x, xmean). This way also uses multiple dispatch, without introducing any new types.

StefanKarpinski · 2014-03-27T15:17:48Z

I like that tuple approach but I wouldn't mind getting an opinion from @JeffBezanson. I feel like he might have a reason to object that isn't occurring to me.

lindahua · 2014-03-28T00:08:09Z

If no objection, I will implement the tuple-based interface tomorrow and update the PR.

lindahua · 2014-03-28T20:35:11Z

Consider this more, with the zeromean keyword argument, it doesn't seem that we need to use covm or other APIs that allow people to input a known mean.

Suppose you already know the mean xmean, you can always write:

cov(x .- xmean; zeromean=true)

or

z = x .- xmean
cov(z; zeromean=true)
# z can be reused later

I am considering just deprecating the *m functions altogether. Opinions?

nalimilan · 2014-03-28T20:48:49Z

How smart! The only gotcha is that subtracting the mean implies traversing the vector once and allocating space before calling cov, while the *m variants could do the subtraction on the fly if they wanted (though I see that currently you don't do this).

Personally I really don't care about this very subtle potential difference in performance, and the variants could easily be added later if needed.

lindahua · 2014-03-28T20:56:16Z

Currently, the implementation of covm actually just does the subtraction for the user. The varm does everything in one pass. But the performance difference doesn't seem to be very big.

My proposal: modify this PR a little bit, remove covm and corm, which simply just does x .- xmean for the user, nothing more. The varm and stdm may continue to stay if we don't want to get rid of them for now.

StefanKarpinski · 2014-03-28T21:54:30Z

How about making the keyword mean instead:

julia> var(x; mean=Base.mean(x)) = (x,mean)
var (generic function with 1 method)

julia> var([1,2,3])
([1,2,3],2.0)

To specify the mean just do something like var(x, mean=0).

nalimilan · 2014-03-28T23:00:40Z

@StefanKarpinski Isn't it what you proposed here? #5249 (comment) I like this syntax, but it's not clear whether the performance hit due to the keyword argument can be eliminated in the future (cf. the discussion following your comment)..

lindahua · 2014-03-29T00:14:12Z

I am fine with the mean keyword argument. In particularly, we can do

cov(x; mean=m) = covm(x, m)

Even if the type of mean is not specialized, the only overhead is to resolve the proper covm method in runtime. As the computation in cov is usually heavy, such overhead is generally negligible. In this way, we can have covm as an internal implementation function, which need not be exported.

The problem remains API design, what about cov(x, y). Shall we use xmean and ymean, as?

cov(x; mean=m)
cov(x, y; xmean=mx, ymean=my)

StefanKarpinski · 2014-03-29T02:15:34Z

How about mean is a triple of the same size as the number of variables?

nalimilan · 2014-03-29T10:06:12Z

Yeah, the 2-uple sounds like the best solution.

jiahao · 2014-03-29T14:25:48Z

How about cov(x, y; mean=(mx,my))?

lindahua · 2014-03-29T14:43:14Z

@jiahao, that's what @StefanKarpinski suggested. I am implementing this. Will be ready soon.

lindahua · 2014-03-29T15:55:40Z

This is the new API:

cov(x[; vardim=1, corrected=true])    # default: using Base.mean to compute mean
cov(x; mean=0[; vardim=1, corrected=true])  # zero mean
cov(x; mean=m[; vardim=1, corrected=true])   # user provided mean

cov(x, y[; vardim=1, corrected=true])    # default: using Base.mean to compute mean
cov(x, y; mean=0[; vardim=1, corrected=true])  # zero mean
cov(x, y; mean=(mx, my)[; vardim=1, corrected=true])   # user provided means

cor(x[; vardim=1])    # default: using Base.mean to compute mean
cor(x; mean=0[; vardim=1])  # zero mean
cor(x; mean=m[; vardim=1])   # user provided mean

cor(x, y[; vardim=1])    # default: using Base.mean to compute mean
cor(x, y; mean=0[; vardim=1])  # zero mean
cor(x, y; mean=(mx, my)[; vardim=1])   # user provided means

The mean, corrected and vardim keyword arguments are also added to var and std.

lindahua · 2014-03-29T16:15:09Z

I think it is ready to merge. If no further API change is needed, I will merge this soon.

nalimilan · 2014-03-29T16:49:56Z

This API looks perfect to me. ;-)

StefanKarpinski · 2014-03-29T19:08:35Z

My one last (I swear) possible bikeshed is the name of the vardim keyword. Would this be better to call variable? Is vardim clearer or just a weirder name? Sorry for the additional bikeshed. Once we settle on this it would be great to never have to change it ever again :-)

lindahua · 2014-03-29T19:40:55Z

vardim indicates the dimension of variables. So it has the meaning of dimension.

When vardim = 1, it indicates that the variables are in columns, otherwise (vardim = 2) the variables are in rows.

The name variable does not convey the meaning of dimension. At first glance, variable = 1 seems to mean the variable value is 1.

StefanKarpinski · 2014-03-29T20:01:18Z

That seems like a reasonable argument. Merge at will!

toivoh · 2014-03-29T20:08:14Z

Just to continue thd bikeshed :) In numpy, this kind of argument is
invariably called axis. I think that we should be similarly consistent.

StefanKarpinski · 2014-03-29T20:09:38Z

I dunno. I don't find the name axis to be enlightening at all.

toivoh · 2014-03-29T20:19:38Z

We're talking about which dimension to reduce along. Although I like
axis, I primarily think that we should be consistent with reduction
functions and others that take the same kind of argument. I think that
dim would be fine; I don't see that vardim adds anything beyond that.

StefanKarpinski · 2014-03-29T20:23:09Z

Oh, I see what you're saying – not that we should use axis but that we should be consistent.

lindahua · 2014-03-29T22:59:40Z

There are two kinds of dimensions. The dimension of each sample/observation, or the dimension of each variable. I am just being explicit here.

StefanKarpinski · 2014-03-30T02:56:35Z

I'm not really clear on how this relates to the dim keyword argument for reductions. Does it? @toivoh, do you have some unifying view of these kinds of dimensions arguments?

lindahua · 2014-03-30T11:51:58Z

Reduction functions actually do not use keyword arguments. They simply use the second positional argument to specify the dimensions to reduce over.

nalimilan · 2014-03-30T12:30:19Z

cov is a kind of reduction along the dimension of variables. But indeed dim is ambiguous; vardim looks like a good choice to me, both explicit enough and similar enough to dims used in reductions.

New implementation of cov/cor with extended API

lindahua · 2014-03-30T19:16:55Z

I will add documentation & news later.

the external abi is to call var, the internal abi doesn' need to branch to alternative functions based on whether mean is given as zero that simply made the dispatch less straightforward to understand even though doing an exact comparison to 0 isn't generally reliable ref #6273, which initially introduced these functions as the public API, before changing them to be the internal implementation

lindahua added 10 commits March 23, 2014 14:06

add keyword arguments to var and std

8f4d9da

Merge branch 'master' into dh/cov2

20e5df8

new covm and cov implementations

1e6073c

cov tested

357f961

use unscaled_cov as core function

c004fe9

simplify the interface of covm and cov

d09192b

add new corm and cor methods

5126c0d

cor & corm tested

dab5fc4

export covm and corm

06d513a

optimized implementation of unscaled_covzm

9c2f8e0

Merge branch 'master' into dh/cov2

b178c71

new API for function cov

91bba8f

lindahua added 3 commits March 29, 2014 10:12

new API for cor

6628f3d

mean keyword argument for var and std

03594ab

minor adjustment of API for cov and cor

3b947d7

minor adjustment of API for cov

7513a20

lindahua added a commit that referenced this pull request Mar 30, 2014

Merge pull request #6273 from lindahua/dh/cov2

5c72672

New implementation of cov/cor with extended API

lindahua merged commit 5c72672 into JuliaLang:master Mar 30, 2014

lindahua mentioned this pull request Mar 31, 2014

More options for statistical functions #5249

Closed

This was referenced Apr 11, 2014

var(), sd() and cov() definitions #5023

Closed

Add corrcov, cov2cor function #4113

Closed

vtjnash mentioned this pull request Jun 5, 2016

change identity comparisons to use isequal #16764

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New implementation of cov/cor with extended API #6273

New implementation of cov/cor with extended API #6273

lindahua commented Mar 27, 2014

johnmyleswhite commented Mar 27, 2014

lindahua commented Mar 27, 2014

jiahao commented Mar 27, 2014

nalimilan commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

lindahua commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

lindahua commented Mar 28, 2014

lindahua commented Mar 28, 2014

nalimilan commented Mar 28, 2014

lindahua commented Mar 28, 2014

StefanKarpinski commented Mar 28, 2014

nalimilan commented Mar 28, 2014

lindahua commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

nalimilan commented Mar 29, 2014

jiahao commented Mar 29, 2014

lindahua commented Mar 29, 2014

lindahua commented Mar 29, 2014

lindahua commented Mar 29, 2014

nalimilan commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

lindahua commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

toivoh commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

toivoh commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

lindahua commented Mar 29, 2014

StefanKarpinski commented Mar 30, 2014

lindahua commented Mar 30, 2014

nalimilan commented Mar 30, 2014

lindahua commented Mar 30, 2014

New implementation of cov/cor with extended API #6273

New implementation of cov/cor with extended API #6273

Conversation

lindahua commented Mar 27, 2014

johnmyleswhite commented Mar 27, 2014

lindahua commented Mar 27, 2014

jiahao commented Mar 27, 2014

nalimilan commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

lindahua commented Mar 27, 2014

StefanKarpinski commented Mar 27, 2014

lindahua commented Mar 28, 2014

lindahua commented Mar 28, 2014

nalimilan commented Mar 28, 2014

lindahua commented Mar 28, 2014

StefanKarpinski commented Mar 28, 2014

nalimilan commented Mar 28, 2014

lindahua commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

nalimilan commented Mar 29, 2014

jiahao commented Mar 29, 2014

lindahua commented Mar 29, 2014

lindahua commented Mar 29, 2014

lindahua commented Mar 29, 2014

nalimilan commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

lindahua commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

toivoh commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

toivoh commented Mar 29, 2014

StefanKarpinski commented Mar 29, 2014

lindahua commented Mar 29, 2014

StefanKarpinski commented Mar 30, 2014

lindahua commented Mar 30, 2014

nalimilan commented Mar 30, 2014

lindahua commented Mar 30, 2014