-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make covariance and correlation work for iterators, skipmissing in particular. #34
base: master
Are you sure you want to change the base?
Conversation
Project.toml
Outdated
@@ -1,5 +1,4 @@ | |||
name = "Statistics" | |||
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's a hack to make sure julia knows to load this folder, it's described here for Pkg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally the Travis script does that automatically, so you can revert this: https://github.com/JuliaLang/Statistics.jl/blob/master/.travis.yml#L24
Though you need it to run tests locally.
I have added the functionality we want and added tests. What's left, assuming what I've written is okay, is to disallow some things that only kind of work at the moment.
The above works, but we can't add any of the |
collects
everywhere
src/Statistics.jl
Outdated
|
||
Return the number one. | ||
""" | ||
cor(itr::Any) = one(real(eltype(collect(itr)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better first check whether Base.IteratorEltype(itr) isa Base.HasEltype && isconcrete(eltype(itr))
, and in that case avoid calling collect
.
Also remove the docstring for AbstractVector
below, which is just a special case of this one.
src/Statistics.jl
Outdated
@@ -630,7 +663,7 @@ function cov2cor!(C::AbstractMatrix, xsd::AbstractArray, ysd::AbstractArray) | |||
end | |||
|
|||
# corzm (non-exported, with centered data) | |||
|
|||
corzm(x::Any) = corzm(collect(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here and for corm
as for cor
about using the eltype
when it's known.
src/Statistics.jl
Outdated
|
||
Compute the covariance between the iterators `x` and `y`. If `corrected` is `true` (the | ||
default), computes ``\\frac{1}{n-1}\\sum_{i=1}^n (x_i-\\bar x) (y_i-\\bar y)^*`` where | ||
``*`` denotes the complex conjugate and `n = length(collect(x)) = length(collect(y))`. If `corrected` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``*`` denotes the complex conjugate and `n = length(collect(x)) = length(collect(y))`. If `corrected` is | |
``*`` denotes the complex conjugate and ``n`` the number of elements. If `corrected` is |
src/Statistics.jl
Outdated
|
||
Compute the variance of the iterator `itr`. If `corrected` is `true` (the default) then the sum | ||
is scaled with `n-1`, whereas the sum is scaled with `n` if `corrected` is `false` where | ||
`n = length(collect(itr))`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`n = length(collect(itr))`. | |
``n`` is the number of elements. |
src/Statistics.jl
Outdated
""" | ||
function cov(itr::Any; corrected::Bool=true) | ||
x = collect(itr) | ||
covm(x, mean(x); corrected=corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better call covzm
directly to avoid an additional copy:
covm(x, mean(x); corrected=corrected) | |
covzm(map!(t -> t - xmean, x, x); corrected=corrected) |
Same for the two-argument method.
src/Statistics.jl
Outdated
@@ -518,16 +519,32 @@ end | |||
# covm (with provided mean) | |||
## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector} | |||
## which can't be handled by broadcast | |||
covm(itr::Any, itrmean; corrected::Bool=true) = | |||
@show covm(collect(itr), itrmean; corrected=corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@show covm(collect(itr), itrmean; corrected=corrected) | |
covm(collect(itr), itrmean; corrected=corrected) |
Project.toml
Outdated
@@ -1,5 +1,4 @@ | |||
name = "Statistics" | |||
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally the Travis script does that automatically, so you can revert this: https://github.com/JuliaLang/Statistics.jl/blob/master/.travis.yml#L24
Though you need it to run tests locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
If these methods give correct results, then it's OK to allow it and only add keyword arguments later if needed. But they should be tested.
Can you also add tests for cov
and cor
?
src/Statistics.jl
Outdated
@@ -644,9 +671,10 @@ corzm(x::AbstractMatrix, y::AbstractMatrix, vardim::Int=1) = | |||
cov2cor!(unscaled_covzm(x, y, vardim), sqrt!(sum(abs2, x, dims=vardim)), sqrt!(sum(abs2, y, dims=vardim))) | |||
|
|||
# corm | |||
|
|||
corm(x::Any, xmean) = corm(collect(x), xmean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also apply the eltype
check here.
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
I have added many more tests. Everything is covered. The rules for
Are that the rows of I'm not sure how to best put that into a doscstring. Future PRs should
|
src/Statistics.jl
Outdated
@@ -630,7 +653,13 @@ function cov2cor!(C::AbstractMatrix, xsd::AbstractArray, ysd::AbstractArray) | |||
end | |||
|
|||
# corzm (non-exported, with centered data) | |||
|
|||
function corzm(itr::Any) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this code in an internal method which will be called by all functions that need it? It's repeated three times.
Also:
function corzm(itr::Any) | |
function corzm(itr::Any) |
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
|
The methods mutate the result to subtract the mean, so calling
Yes that's something that bothered me too. Probably only vectors should be allowed to be mixed with other iterators. For other cases it's not clear what should happen so better throw an error for now. |
I do not think they always do. Have a look at |
Ah right. So we need |
I was under the impression that |
@bkamins you are right about non-allocations. But adding methods results in tons of method ambiguity errors. To resolve this without re-thinking the whole dispatch scheme, I implemented
just in places where we don't modify This result feels hacky, but it's better than method ambiguities. This is ready for review, by Milan and hopefully by Triage. |
|
||
corm(cx, mean(cx), cy, mean(cy)) | ||
end | ||
|
||
""" | ||
cor(x::AbstractVector, y::AbstractVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this docstring which is a special case of the previous one.
src/Statistics.jl
Outdated
_lazycollect(x::Any) = collect(x) | ||
_lazycollect(x::AbstractVector) = x | ||
|
||
function _matrix_error(x, y, fun) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just throw an error from _lazycollect
if passed a matrix? The error message will be less precise but that's not a big deal. And then you can also throw an error for any AbstractArray
that isn't an AbstractVector
, which is a case which isn't allowed currently and should probably remain an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I would just add a special method to _lazycollect
for other types. Actually I would do it for AbstractArray
not only AbstractMatrix
. The only thing to think about if we want to allow 0-dimensional AbstractArrays
(they would produce NaN
anyway).
I think that collecting 2 or more dimensional arrays in places where we expect vectors is not useful (but we can discuss this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work because we don't use it all the time, for instance when we collect
and use map!
.
I can add a _collect_if_itr_or_vec
method for that scenario.
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Yes, but these "many functions" have also a defined meaningful behaviour for Actually I would have preferred a completely different design for Now regarding |
Current implementation excludes
Yes. The original attempt at #30 was essentially this, working only with iterators. I wold prefer an implementation which doesn't know about matrix inputs and only cares about iterators. Taking advantage of BLAS etc. for
This PR is motivated by the new |
Ah - then I would also prefer iterators as you do 😄. Actually @nalimilan proposed And I fully agree that having Regarding |
Say
|
But the typical use case is the following setting:
and you want to calculate correlation matrix of the columns using "pairwise complete observations". |
There is currently no |
We discussed this a bit on triage and while we don't feel super qualified to comment, we felt that having independent iterators for the two arguments was likely problematic, because iterators don't in general have a strong guarantee over their ordering. E.g. |
@Keno - this is exactly what needs to be done in an ideal world in my opinion (and then you can have several rules what you "pass down" to the But doing it right would require a significant API change (i.e. be breaking) - so what should be a practical approach to this? Should a breaking PR be put on a table so that it can be judged for inclusion in 2.0 release. An alternative is to leave What would be the preference here? |
I'd say design the interface you want and then figure out whether it's possible to do most of it backwards compatible. If not, stdlibs can also go to 2.0 before base Julia (that's the whole reason why we introduced them). |
So let me put my proposal on a table (it is essentially what I think @pdeffebach had in mind in #30). I write it for
Here The second form is:
That does the same but between In this design we can treat everything inside To be less breaking we can keep methods:
(note that I have added Alternatively we could allow |
Thank you for your comments. Here are my thoughts: I think it's important to understand the purpose
Now they move onto the rest of their data, which now has So That said, if the researcher wrote their Therefore, I would want |
See previous discussion at JuliaStats/StatsBase.jl#343. I think there are legitimate use cases for the current design of Also, allowing to pass any iterator to Now, regarding the So an alternative I had in mind is to introduce Probably this shouldn't be discussed here... :-) BTW, another design challenge is to allow combining this with weights to allow computing weighted pairwise correlation while skipping missing values. Composability would be great to have in that case. |
@@ -504,6 +516,10 @@ function covzm(x::AbstractMatrix, vardim::Int=1; corrected::Bool=true) | |||
A .= A .* b | |||
return A | |||
end | |||
function covzm(x::Any, y::Any; corrected::Bool = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case covzm
can get Any
? It is an internal method and I thought it can only get already processed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same question applies to covm
below.
""" | ||
function cov(itr::Any; corrected::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to allow 0 or more than 2 dimensional arrays here?
|
||
Return the number one. | ||
""" | ||
cor(itr::Any) = _return_one(itr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we touch this part of code then I do not understand the following (this is in general a separate PR, but we implement this method here, so I would like to clarify what is intended):
julia> x = rand(10, 2)
10×2 Array{Float64,2}:
0.281236 0.0338547
0.691944 0.830649
0.627939 0.62187
0.251539 0.162161
0.649065 0.627302
0.67754 0.227709
0.904292 0.481443
0.768511 0.439196
0.56268 0.885131
0.520348 0.0185026
julia> y = collect(eachrow(x))
10-element Array{SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true},1}:
[0.28123641854321346, 0.03385469973171129]
[0.6919437304763723, 0.8306494675149141]
[0.6279387501997036, 0.6218700109315964]
[0.2515386883173856, 0.16216070470557398]
[0.6490648176650291, 0.6273015737594985]
[0.6775400549397448, 0.22770867432380815]
[0.9042917317032804, 0.4814426050177454]
[0.7685107010600403, 0.4391959677108144]
[0.562680390776801, 0.8851305746997942]
[0.5203479821836228, 0.018502575073925165]
julia> cov(x)
2×2 Array{Float64,2}:
0.0409994 0.0321085
0.0321085 0.0983311
julia> cov(y)
2×2 Array{Float64,2}:
0.0409994 0.0321085
0.0321085 0.0983311
julia> cor(x)
2×2 Array{Float64,2}:
1.0 0.505691
0.505691 1.0
julia> cor(y)
ERROR: MethodError: no method matching zero(::Type{SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true}})
and I do not understand why cov
and cor
behave differently in how they handle x
and y
.
OK - if we want to go forward with the API proposal in this PR I have left some comments related only to it. |
Ah - and a general question, since |
I just became aware of this method of """
cor(x::AbstractVector)
Return the number one.
""" This basically torpedoes any idea of having a itr = ((1, 4), (3, 2), (5, 8), (7, 6))
cor(itr) itr = [(1, 4), (3, 2), (5, 8), (7, 6)]
cor(itr) Given this unfortunate situation, perhaps triage would reconsider allowing |
So, maybe removing single-argument For functions where arguments are fundamentally coupled (like |
Currently
cov
andcor
fail with iterators that are not vectors, e.g.skipmissing
iterators or vectors withIterators.Filter
applied to them. This is part of the plan I have commented on at JuliaLang/julia#35050 (comment) to improve quality of life issues withmissings
.Thanks to
Missings.skipmissings
(JuliaData/Missings.jl#111), this allows computing the correlation without missing values viacor(skipmissings(x, y)...)
.Supersedes #30 because it is a more minimal implementation.