More options for statistical functions #5249

lindahua · 2013-12-28T14:30:50Z

In practice, I find that the current interface of some statistical functions in Julia base is limited. It would be great to allow more options (preferably through keyword arguments).

Three options that I think are useful:

byrows: currently, cov(X, Y) considers each column as a variable and each row as an observation (sample). This is useful in a lot of settings. However, the opposite case is also very important. In many problems, we consider each column as a sample, and each row as a variable. Particularly, this is the convention of Distributions.jl to generate samples by columns.

The function cor also needs such an option.

allow users to choose between normalization by N-1 or by N. Or more generally, we can learn Numpy, and introduce an ddof argument, and normalize the result by N - ddof. Of course, the default value of ddof should be 1.

This option also applies to var, std, and cor.

introduce covm and corm that allows user to additionally supply a pre-computed mean.

Note: Numpy allows the first two options: http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

I can implement all these in a PR. Before I do this, I would like to see people's opinions.

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2013-12-28T14:33:59Z

I wholeheartedly agree. I was considering making a PR for the byrows change myself yesterday.

johnmyleswhite · 2013-12-28T14:36:13Z

For reference, there have been a lot of debates about the normalization constant before.

nalimilan · 2013-12-30T16:45:14Z

Yeah, see #5023 for the cor() normalization issue.

andreasnoack · 2013-12-30T21:09:20Z

Just out of curiosity: What is the argument, besides habit, for observations as columns?

lindahua · 2013-12-30T21:18:17Z

It is a widely adopted convention in machine learning literatures that consider each feature vector as a column when working with a data set.

There are a lot of algorithms that consider each feature vector (observation) as a whole, and process them one by one. In such algorithms, arrange feature vectors (observations) as columns would be more efficient (given that Julia is column-major).

lindahua · 2014-03-19T13:32:14Z

I find myself needing an extended version of cov and cor that allow non corrected estimation and considers observations as columns, when I am writing some machine learning packages.

I now propose the following syntax:

cov(x; ...)
cov(x, y; ...)
cor(x; ...)
cor(x, y; ...)

They support three keyword arguments:

corrected (according to var(), sd() and cov() definitions #5023), default = true
rowvar: when this set to true, it considers variables as rows and observations as columns; otherwise, it considers variables as columns and observations as rows (default = false, conforming to current behavior)
zeromean: default = false. When this is set to true, it indicates that the input samples have been centered and thus have zero mean, therefore no need to center them within the function. (I have considered the name centered for this, but find it not as explicit).

I will do add a weighted version in StatsBase.jl (which might rely on the WeightVec type there) just like the weighted mean.

I think all these three keyword arguments are quite useful. The only issue might be the choice of names. I will add a PR for this soon, as some other packages are pending on this.

StefanKarpinski · 2014-03-19T14:36:09Z

The names seem ok to me, although I might suggest rowvars, plural. Also, should it be mean=0 instead so that specifying a different mean can be done, or is that not common? That would also leave the issue of what default values should indicate that the inputs should be centered.

lindahua · 2014-03-19T14:38:49Z

We can do covm and corm (just like varm) which allows a mean vector to be provided. It is not easy to use a keyword argument to do two things in a type stable way: (1) indicate whether the data have been centered, and (2) specify a pre-computed mean.

StefanKarpinski · 2014-03-19T14:44:16Z

That seems reasonable.

toivoh · 2014-03-19T15:34:16Z

Shouldn't we allow to take covariances along any axis, not just rows or
columns?

On Wed, Mar 19, 2014 at 3:44 PM, Stefan Karpinski
[email protected]:

That seems reasonable.

Reply to this email directly or view it on GitHubhttps://github.com//issues/5249#issuecomment-38058411
.

StefanKarpinski · 2014-03-19T15:43:52Z

Yes, that's a very good point. Maybe we should call it vardim and default to 1.

lindahua · 2014-03-19T15:46:15Z

I am ok with the vardim idea.

Currently, cov only accepts matrix arguments. I have no immediate intention to extend that to accept arrays of arbitrary ranks. However, having the keyword argument named vardim enables us to extend the methods without changing the interface in future. So I will implement this suggestion.

nalimilan · 2014-03-19T19:18:23Z

Makes sense. Though mean = 0 would really be the most elegant solution. Isn't there any chance that in the future Julia will allow optimizing this? Have you done some benchmarking? Putting the centering code in a if mean != 0 block would easily be optimized away if one day Julia was smart enough to detect that the code special-cases 0 and so the compiler should do the same.

lindahua · 2014-03-19T20:02:35Z

@nalimilan Julia is not dispatched on keyword arguments, implying that Julia won't compile another method if the keyword arguments are of a different type, and therefore won't be able to do a lot of optimization here. I don't think changes of these behavior would happen anytime soon.

We already have varm and stdm, hence I believe covm and corm are consistent with this.

toivoh · 2014-03-19T20:24:16Z

But Julia could inline a short method definition for the keywords case,
which could defer to standard multiple dispatch or similar.

nalimilan · 2014-03-21T13:14:38Z

@lindahua I guess it really depends whether "not anytime soon" means "never" or "not in the next few months". APIs are going to stay for a long time (if not forever), so I'd say better suffer from a slight performance loss waiting for optional dispatching on keyword arguments if a solution is planned in the long term.

I think this situation (where parameters can be set to arbitrary values, but some values are much more common than others and should thus be optimized) is going to arise with many other functions. So the present case is a good way of thinking about how to handle this as cleanly as possible in Julia. It would be awesome to avoid having to write separate functions (varm) for what's just a specialized compilation of the original function (var); to me this is the sign that the language/compiler should gain more features or become smarter.

lindahua · 2014-03-21T13:41:10Z

@nalimilan Regarding this problem, there was a long discussion in #2265. Finally, we decided to use varm and stdm. What I am going to do conforms to what we are doing. Perhaps we can use another issue to discuss whether we change varm, stdm etc to using the mean keyword argument.

However, I think it is important to have consistent API (var, std, cov, cor and friends have a consistent way of dealing with arguments.) If we decide to deprecate varm and stdm in favor of the mean argument, we can then make all the changes at the same time.

nalimilan · 2014-03-21T14:09:57Z

@lindahua Sure.

nalimilan · 2014-03-21T14:15:18Z

(Note that when #2265 was discussed, keyword arguments did not exist.)

lindahua · 2014-03-31T13:19:41Z

Closed by #6273.

lindahua added feature and removed feature labels Mar 19, 2014

lindahua mentioned this issue Mar 27, 2014

New implementation of cov/cor with extended API #6273

Merged

lindahua closed this as completed Mar 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More options for statistical functions #5249

More options for statistical functions #5249

lindahua commented Dec 28, 2013

johnmyleswhite commented Dec 28, 2013

johnmyleswhite commented Dec 28, 2013

nalimilan commented Dec 30, 2013

andreasnoack commented Dec 30, 2013

lindahua commented Dec 30, 2013

lindahua commented Mar 19, 2014

StefanKarpinski commented Mar 19, 2014

lindahua commented Mar 19, 2014

StefanKarpinski commented Mar 19, 2014

toivoh commented Mar 19, 2014

StefanKarpinski commented Mar 19, 2014

lindahua commented Mar 19, 2014

nalimilan commented Mar 19, 2014

lindahua commented Mar 19, 2014

toivoh commented Mar 19, 2014

nalimilan commented Mar 21, 2014

lindahua commented Mar 21, 2014

nalimilan commented Mar 21, 2014

nalimilan commented Mar 21, 2014

lindahua commented Mar 31, 2014

More options for statistical functions #5249

More options for statistical functions #5249

Comments

lindahua commented Dec 28, 2013

johnmyleswhite commented Dec 28, 2013

johnmyleswhite commented Dec 28, 2013

nalimilan commented Dec 30, 2013

andreasnoack commented Dec 30, 2013

lindahua commented Dec 30, 2013

lindahua commented Mar 19, 2014

StefanKarpinski commented Mar 19, 2014

lindahua commented Mar 19, 2014

StefanKarpinski commented Mar 19, 2014

toivoh commented Mar 19, 2014

StefanKarpinski commented Mar 19, 2014

lindahua commented Mar 19, 2014

nalimilan commented Mar 19, 2014

lindahua commented Mar 19, 2014

toivoh commented Mar 19, 2014

nalimilan commented Mar 21, 2014

lindahua commented Mar 21, 2014

nalimilan commented Mar 21, 2014

nalimilan commented Mar 21, 2014

lindahua commented Mar 31, 2014