-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More options for statistical functions #5249
Comments
I wholeheartedly agree. I was considering making a PR for the |
For reference, there have been a lot of debates about the normalization constant before. |
Yeah, see #5023 for the |
Just out of curiosity: What is the argument, besides habit, for observations as columns? |
It is a widely adopted convention in machine learning literatures that consider each feature vector as a column when working with a data set. There are a lot of algorithms that consider each feature vector (observation) as a whole, and process them one by one. In such algorithms, arrange feature vectors (observations) as columns would be more efficient (given that Julia is column-major). |
I find myself needing an extended version of I now propose the following syntax: cov(x; ...)
cov(x, y; ...)
cor(x; ...)
cor(x, y; ...) They support three keyword arguments:
I will do add a weighted version in StatsBase.jl (which might rely on the I think all these three keyword arguments are quite useful. The only issue might be the choice of names. I will add a PR for this soon, as some other packages are pending on this. |
The names seem ok to me, although I might suggest |
We can do |
That seems reasonable. |
Shouldn't we allow to take covariances along any axis, not just rows or On Wed, Mar 19, 2014 at 3:44 PM, Stefan Karpinski
|
Yes, that's a very good point. Maybe we should call it |
I am ok with the Currently, |
Makes sense. Though |
@nalimilan Julia is not dispatched on keyword arguments, implying that Julia won't compile another method if the keyword arguments are of a different type, and therefore won't be able to do a lot of optimization here. I don't think changes of these behavior would happen anytime soon. We already have |
But Julia could inline a short method definition for the keywords case, |
@lindahua I guess it really depends whether "not anytime soon" means "never" or "not in the next few months". APIs are going to stay for a long time (if not forever), so I'd say better suffer from a slight performance loss waiting for optional dispatching on keyword arguments if a solution is planned in the long term. I think this situation (where parameters can be set to arbitrary values, but some values are much more common than others and should thus be optimized) is going to arise with many other functions. So the present case is a good way of thinking about how to handle this as cleanly as possible in Julia. It would be awesome to avoid having to write separate functions ( |
@nalimilan Regarding this problem, there was a long discussion in #2265. Finally, we decided to use However, I think it is important to have consistent API ( |
@lindahua Sure. |
(Note that when #2265 was discussed, keyword arguments did not exist.) |
Closed by #6273. |
In practice, I find that the current interface of some statistical functions in Julia base is limited. It would be great to allow more options (preferably through keyword arguments).
Three options that I think are useful:
byrows
: currently,cov(X, Y)
considers each column as a variable and each row as an observation (sample). This is useful in a lot of settings. However, the opposite case is also very important. In many problems, we consider each column as a sample, and each row as a variable. Particularly, this is the convention of Distributions.jl to generate samples by columns.The function
cor
also needs such an option.N-1
or byN
. Or more generally, we can learn Numpy, and introduce anddof
argument, and normalize the result byN - ddof
. Of course, the default value ofddof
should be 1.This option also applies to
var
,std
, andcor
.covm
andcorm
that allows user to additionally supply a pre-computed mean.Note: Numpy allows the first two options: http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
I can implement all these in a PR. Before I do this, I would like to see people's opinions.
The text was updated successfully, but these errors were encountered: