Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More options for statistical functions #5249

Closed
lindahua opened this issue Dec 28, 2013 · 20 comments
Closed

More options for statistical functions #5249

lindahua opened this issue Dec 28, 2013 · 20 comments

Comments

@lindahua
Copy link
Contributor

In practice, I find that the current interface of some statistical functions in Julia base is limited. It would be great to allow more options (preferably through keyword arguments).

Three options that I think are useful:

  • byrows: currently, cov(X, Y) considers each column as a variable and each row as an observation (sample). This is useful in a lot of settings. However, the opposite case is also very important. In many problems, we consider each column as a sample, and each row as a variable. Particularly, this is the convention of Distributions.jl to generate samples by columns.

The function cor also needs such an option.

  • allow users to choose between normalization by N-1 or by N. Or more generally, we can learn Numpy, and introduce an ddof argument, and normalize the result by N - ddof. Of course, the default value of ddof should be 1.

This option also applies to var, std, and cor.

  • introduce covm and corm that allows user to additionally supply a pre-computed mean.

Note: Numpy allows the first two options: http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

I can implement all these in a PR. Before I do this, I would like to see people's opinions.

@johnmyleswhite
Copy link
Member

I wholeheartedly agree. I was considering making a PR for the byrows change myself yesterday.

@johnmyleswhite
Copy link
Member

For reference, there have been a lot of debates about the normalization constant before.

@nalimilan
Copy link
Member

Yeah, see #5023 for the cor() normalization issue.

@andreasnoack
Copy link
Member

Just out of curiosity: What is the argument, besides habit, for observations as columns?

@lindahua
Copy link
Contributor Author

It is a widely adopted convention in machine learning literatures that consider each feature vector as a column when working with a data set.

There are a lot of algorithms that consider each feature vector (observation) as a whole, and process them one by one. In such algorithms, arrange feature vectors (observations) as columns would be more efficient (given that Julia is column-major).

@lindahua
Copy link
Contributor Author

I find myself needing an extended version of cov and cor that allow non corrected estimation and considers observations as columns, when I am writing some machine learning packages.

I now propose the following syntax:

cov(x; ...)
cov(x, y; ...)
cor(x; ...)
cor(x, y; ...)

They support three keyword arguments:

  • corrected (according to var(), sd() and cov() definitions #5023), default = true
  • rowvar: when this set to true, it considers variables as rows and observations as columns; otherwise, it considers variables as columns and observations as rows (default = false, conforming to current behavior)
  • zeromean: default = false. When this is set to true, it indicates that the input samples have been centered and thus have zero mean, therefore no need to center them within the function. (I have considered the name centered for this, but find it not as explicit).

I will do add a weighted version in StatsBase.jl (which might rely on the WeightVec type there) just like the weighted mean.

I think all these three keyword arguments are quite useful. The only issue might be the choice of names. I will add a PR for this soon, as some other packages are pending on this.

@lindahua lindahua added feature and removed feature labels Mar 19, 2014
@StefanKarpinski
Copy link
Member

The names seem ok to me, although I might suggest rowvars, plural. Also, should it be mean=0 instead so that specifying a different mean can be done, or is that not common? That would also leave the issue of what default values should indicate that the inputs should be centered.

@lindahua
Copy link
Contributor Author

We can do covm and corm (just like varm) which allows a mean vector to be provided. It is not easy to use a keyword argument to do two things in a type stable way: (1) indicate whether the data have been centered, and (2) specify a pre-computed mean.

@StefanKarpinski
Copy link
Member

That seems reasonable.

@toivoh
Copy link
Contributor

toivoh commented Mar 19, 2014

Shouldn't we allow to take covariances along any axis, not just rows or
columns?

On Wed, Mar 19, 2014 at 3:44 PM, Stefan Karpinski
[email protected]:

That seems reasonable.

Reply to this email directly or view it on GitHubhttps://github.com//issues/5249#issuecomment-38058411
.

@StefanKarpinski
Copy link
Member

Yes, that's a very good point. Maybe we should call it vardim and default to 1.

@lindahua
Copy link
Contributor Author

I am ok with the vardim idea.

Currently, cov only accepts matrix arguments. I have no immediate intention to extend that to accept arrays of arbitrary ranks. However, having the keyword argument named vardim enables us to extend the methods without changing the interface in future. So I will implement this suggestion.

@nalimilan
Copy link
Member

Makes sense. Though mean = 0 would really be the most elegant solution. Isn't there any chance that in the future Julia will allow optimizing this? Have you done some benchmarking? Putting the centering code in a if mean != 0 block would easily be optimized away if one day Julia was smart enough to detect that the code special-cases 0 and so the compiler should do the same.

@lindahua
Copy link
Contributor Author

@nalimilan Julia is not dispatched on keyword arguments, implying that Julia won't compile another method if the keyword arguments are of a different type, and therefore won't be able to do a lot of optimization here. I don't think changes of these behavior would happen anytime soon.

We already have varm and stdm, hence I believe covm and corm are consistent with this.

@toivoh
Copy link
Contributor

toivoh commented Mar 19, 2014

But Julia could inline a short method definition for the keywords case,
which could defer to standard multiple dispatch or similar.

@nalimilan
Copy link
Member

@lindahua I guess it really depends whether "not anytime soon" means "never" or "not in the next few months". APIs are going to stay for a long time (if not forever), so I'd say better suffer from a slight performance loss waiting for optional dispatching on keyword arguments if a solution is planned in the long term.

I think this situation (where parameters can be set to arbitrary values, but some values are much more common than others and should thus be optimized) is going to arise with many other functions. So the present case is a good way of thinking about how to handle this as cleanly as possible in Julia. It would be awesome to avoid having to write separate functions (varm) for what's just a specialized compilation of the original function (var); to me this is the sign that the language/compiler should gain more features or become smarter.

@lindahua
Copy link
Contributor Author

@nalimilan Regarding this problem, there was a long discussion in #2265. Finally, we decided to use varm and stdm. What I am going to do conforms to what we are doing. Perhaps we can use another issue to discuss whether we change varm, stdm etc to using the mean keyword argument.

However, I think it is important to have consistent API (var, std, cov, cor and friends have a consistent way of dealing with arguments.) If we decide to deprecate varm and stdm in favor of the mean argument, we can then make all the changes at the same time.

@nalimilan
Copy link
Member

@lindahua Sure.

@nalimilan
Copy link
Member

(Note that when #2265 was discussed, keyword arguments did not exist.)

@lindahua
Copy link
Contributor Author

Closed by #6273.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants