Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dot handle missings #40769

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Make dot handle missings #40769

wants to merge 2 commits into from

Conversation

dkarrasch
Copy link
Member

Fixes #40743.

@dkarrasch dkarrasch added the domain:linear algebra Linear algebra label May 10, 2021
Copy link
Member

@simeonschaub simeonschaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems uncontroversial enough to me.

@KristofferC
Copy link
Sponsor Member

The doc says that

missing values propagate automatically when passed to standard mathematical operators and functions

Extending this to also encompass the full LinearAlgebra standard library seems excessive. This would require adding a huge number of this special cased Missing methods. And if the goal is not the whole LinearAlgebra standard library, what is special about this function? What about ldiv, norm, det etc?

@Datseris
Copy link

When I read:

missing values propagate automatically when passed to standard mathematical operators and functions

then I expect that the dot of two vectors that have missing will be missing, as dot is the sequential application of +, * which are standard mathematical operators. Same must happen with norm, which is the sequential application of +, ^2, sqrt.

The norm throws off a completely different error by the way:

julia> using LinearAlgebra

julia> norm([1.0, missing, 2.0])
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
 [1] generic_normInf(x::Vector{Union{Missing, Float64}})
   @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\generic.jl:472
 [2] normInf
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\generic.jl:556 [inlined]
 [3] generic_norm2(x::Vector{Union{Missing, Float64}})
   @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\generic.jl:497
 [4] norm2
   @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\generic.jl:558 [inlined]
 [5] norm(itr::Vector{Union{Missing, Float64}}, p::Int64)
   @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\generic.jl:627
 [6] norm(itr::Vector{Union{Missing, Float64}})
   @ LinearAlgebra C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\LinearAlgebra\src\generic.jl:625

Perhaps LinearAlgebra as a whole wasn't designed with potential missing values in mind. I am not sure whether this is intentional or not, but other libraries use LinearAlgebra.dot to do things that do not necessarily count as "linear algebra operations" in the traditional sense.

e.g. I came across the dot error via StatsBase.mean. I think we can all agree that one expects to do a statistical analysis with missing values, even if a linear algebra analysis doesn't really make sense with missing values.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 10, 2021

I think dot should handle missing values, because I was not aware it was recursive, I would imagine dot(x, y) to just call * and + a lot, so handling missing values would propagate normally. It's also a lot more common an operation than ldiv! or norm (though arguably norm should also be written in such a way that it "just works").

That said, skipmissings in Missings.jl can fix this.

julia> using StatsBase, Missings

julia> x = [5, 6, missing];

julia> w = StatsBase.weights([1, 1, 0]);

julia> LinearAlgebra.dot(skipmissings(x, w)...)
11

Downside is that it's going to be less fast than handling dot via dispatch. skipmissing is also intentionally marked as unstable in the Missings.jl 1.0 release.

I have a stagnant PR here which should provide a more general and performant solution. Implementation is a bit beyond me, so I'm looking for someone to take it over.

EDIT: My solution doesn't actually fix OP's problem. They want missings to propagate, whereas this proposal gets rid of them. I reiterate that because dot is conceptually a sequence of operations which all support propagation, dot should propagate missings.

@dkarrasch
Copy link
Member Author

BTW, we do have a definition of norm(::Missing) = missing already. It doesn't get called by the iterator-based methods because they throw for other reasons ("non-Boolean value used in a Boolean context"). I'm aware that adding more and more missing-methods is controversial for good reasons. But I believe that dot and norm are used as fast reduction methods, as @Datseris argued above also in non-linalg-contexts, and that supporting these doesn't imply at all that we should get all of LinearAlgebra missing-capable.

@dkarrasch
Copy link
Member Author

dkarrasch commented May 11, 2021

I'm gonna push the corresponding changes to make norm also work with missings, for discussion and benchmarking. Does anybody have good benchmarks for the generic dot and norm? I tried

using LinearAlgebra, BenchmarkTools
x = convert(Vector{Union{Missing,Float64}}, rand(10000))
@btime norm($x)

and seem to get massive speed-up on this branch?! (33.714 μs vs 2.089 ms (10004 allocations: 156.34 KiB))

EDIT: The speed-up seems to be real, and probably because of some type-instability that is "fixed" here.

@KristofferC
Copy link
Sponsor Member

KristofferC commented May 11, 2021

e.g. I came across the dot error via StatsBase.mean. I think we can all agree that one expects to do a statistical analysis with missing values, even if a linear algebra analysis doesn't really make sense with missing values.

That seems like an issue with StatsBase.mean then? Maybe open an issue there instead?

It's also a lot more common an operation than ldiv! or norm

The frequency by which a function is called surely cannot be used as a deciding factor for whether it should be Missing-special-cased? Either it makes sense, or it doesn't, no matter how often the function is called.

@dkarrasch dkarrasch changed the title Make dot handle missings Make dot and norm handle missings May 11, 2021
@Datseris
Copy link

That seems like an issue with StatsBase.mean then? Maybe open an issue there instead?

No, it's not,StatsBase uses dot as intended. To my understanding of what a "dot product" is, for two AbstractVectors it should boil down to dot = sum(x*y for (x, y) in zip(u, v)). Wikipedia's definition aligns with this idea.

Let's re-wind a bit. @KristofferC can you please state here your definition of a "dot product"? If we have different definitions in mind, then we can clarify that first. However, if we have the same definition in mind, then the situation is clear. StatsBase does what its supposed to, and a dot product of two vectors with missing values must return missing. Same with norm.

@pdeffebach
Copy link
Contributor

The frequency by which a function is called surely cannot be used as a deciding factor for whether it should be Missing-special-cased? Either it makes sense, or it doesn't, no matter how often the function is called.

It's important to note that dot already supported missing values conceptually. We didn't need to add new methods in this implementation because dot(x::Number, y::Number) = conj(x) * y would already support missing values if we widened the type signature, since both conj and * propagate missing.

As I said above, it's the recursiveness that's causing the problem, and this is a technical detail. If sum(x .* y) propagates missing, then so should dot.

@pdeffebach
Copy link
Contributor

@dkarrasch can you split the norm stuff into a separate PR? It's hard enough to get one missing feature added at a time. Maybe we can save the discussion on norm for another day?

@dkarrasch dkarrasch changed the title Make dot and norm handle missings Make dot handle missings May 11, 2021
@cfauchereau
Copy link
Contributor

I noticed that dot(1, [5]) currently return 5 but dot(missing, [5]) would still throw an error since dot(::Missing, ::AbstractArray) is not defined. So, should it be supported? We could define dot for Missing and AbstractArray but other types probably have the same issue. Widening the type signature like previously proposed may be cleaner to solve the issue.

@dkarrasch
Copy link
Member Author

It's a bit unclear. The difference is that both 1 and [5] have length 1, so there is a way to check whether this dot-product should work or not. But if we allow dot(missing, [5]) by types, that implies that dot(missing, [4, 5]) also work, which it wouldn't if you replaced missing by a number.

@stevengj
Copy link
Member

I think the rationale to support missing here is that dot and norm are "elementary" in that that they are essential to the definition of vector algebra (for Hilbert/Banach spaces), in a way that more complicated functions like qr and det are not.

@KristofferC
Copy link
Sponsor Member

KristofferC commented May 18, 2021

(just pointing out that I don't/didn't mean to block this PR but I do think the arguments for it have been kinda weak)

@simeonschaub simeonschaub added needs decision A decision on this change is needed status:triage This should be discussed on a triage call labels May 19, 2021
@vtjnash vtjnash added status:linalg triage and removed status:triage This should be discussed on a triage call labels May 27, 2021
@nalimilan
Copy link
Member

See #42812 for another case where this PR would have been useful.

@nalimilan nalimilan added the status:triage This should be discussed on a triage call label Oct 27, 2021
@andreasnoack
Copy link
Member

So the failure here is a consequence of #27401. @Datseris and @pdeffebach are expecting dot to behave as it did before that PR. In #35174 another issue with that change was brought up. The fix of that issue allows for

julia> x'w
missing

so I think it make more sense just to use that syntax in e.g. StatsBase instead of introducing special handling of missing in linear algebra functions. See also #27677 for a version of dot that would support missing right away.

@JeffBezanson
Copy link
Sponsor Member

From triage:

  • We'd rather have linalg triage make this decision, but perhaps that has already been tried and failed.
  • I don't like adding methods for missing, but I do see the argument that it's weird for a "written-out" dot product using sum to support missing, but dot does not.
  • An alternative is to have dot support missing elements, but not add a method to dot itself. For example instead of calling dot(x, y) recursively, call lift_missing(dot)(x, y). That way it will work without any new user-visible methods.

@LilithHafner LilithHafner removed the status:triage This should be discussed on a triage call label Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:linear algebra Linear algebra domain:missing data Base.missing and related functionality needs decision A decision on this change is needed status:linalg triage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LinearAlgebra.dot fails with Vector{Union{Float, Missing}}