Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NamedTuple covariance #32825

Closed
bramtayl opened this issue Aug 8, 2019 · 13 comments
Closed

NamedTuple covariance #32825

bramtayl opened this issue Aug 8, 2019 · 13 comments

Comments

@bramtayl
Copy link
Contributor

bramtayl commented Aug 8, 2019

I thought there was already an issue open for this but I couldn't find it. I don't know if people saw my JuliaCon talk but I implemented covariant namedtuples for LightQuery. If that would be useful, maybe we could get it in Base, and if its not useful, I'd like to understand why, and I can switch LightQuery over to using DataValues and invariant NamedTuples.

@JeffBezanson
Copy link
Member

See also #24614 😂

I think the main reason invariant named tuples are useful is that a named tuple with Union{Int,Missing} as a field type can be a concrete type with a known layout.

The main issue I see with covariance is that a tuple of Int or Missing must have either the type Tuple{Int} or Tuple{Missing}. If you have N elements there are 2^N possible types. Even if there is an enclosing container with element type Tuple{Union{Int,Missing},...} the types of the elements are still all taken from the set of 2^N. With invariance, every element can have the same type InvariantTuple{Union{Int,Missing}, ...}. As far as I recall that was our reasoning.

@quinnj
Copy link
Member

quinnj commented Aug 8, 2019

In addition to Jeff's great points, I've also been diving a lot into the remaining inconveniences of working w/ Union{T, Missing} and I think we have a pretty good handle on things we can do to make them better:

With #32448, NamedTuples (and all structs) w/ isbits Union fields will be stored inline in Arrays.

With #32699, we actually get into a scenario where you can do a lazy map (a la Query.EnumerableMap, or like this simple Tables.jl version that doesn't rely on inference at all), and the compiler is able to inline/infer everything all the way down, which == great performance and no extra allocations. Now, that does depend on who's collecting the lazy map, and this ideal scenario works for most cases: columnar outputs like DataFrame/IndexedTable/TypedTable, file-based outputs like CSV/Feather, and databases, like ODBC/MySQL/SQLite. The one case that is inconvenient is collecting into an Array of NamedTuples, since the resulting type would be Vector{NamedTuple{names, T} where T <: Tuple{Int, Union{Float64, Missing}}, which wouldn't have an inlined/concrete layout. Now, you can still get the nice case by manually specifying the output type of your NamedTuple mapping function (like return NamedTuple{names, Tuple{Int, Union{Float64, Missing}}}((1, missing)), and I think it'd be worth exploring some syntax ideas to make that simpler/nicer.

And honestly, by having a lazy map that can be inferred when collecting, we do away with the main reason I'm aware of for the DataValue type (since, on a technical basis, the same inference improvement would apply to inferring join key mapping functions, etc.). (the other differences I'm aware of with Missing vs. DataValue are orthogonal to inference issues, like 3-valued logic, etc.). That to me seems like a huge win!

At this point, I think there's strong reasoning and momentum to not further complicate the data ecosystem by hanging on to DataValues. There are so many great developments going on between machine learning frameworks, core statistics, applied math and optimization techniques, and they've all adopted and moved forward with the Missing approach. It would just be unfortunate if a subset of packages kept themselves isolated, needing to always wrap/unwrap themselves to go in/out w/ the rest of the ecosystem instead of "just working" out of the box.

I think sometimes we also get really caught up on temporary inconveniences and miss the forest for the trees; Jeff and I sat for maybe 20 minutes at the JuliaCon hackathon and were able to hammer out a bunch of concrete issues to improve various scenarios. I think it'd be much more productive for everyone to highlight the inconveniences and then come together to figure out solutions; having dug so deep into all of this over the last month or two, I really think it was the right choice to go w/ the Missing representation, and I don't think it fundamentally limits things in any way. Do we need to keep evolving APIs, improving inference, and make things easier/simpler? Definitely. But I, for one, think it would be a huge win to move everything off of DataValue, avoid the need to maintain all these extra data structures, and collectively collaborate on a more cohesive data ecosystem, even if that means dealing with some short-term inconveniences.

@bramtayl
Copy link
Contributor Author

bramtayl commented Aug 8, 2019

If you have N elements there are 2^N possible types

I came to the opposite conclusion by the same reasoning, e.g.

using LightQuery: @name
using InteractiveUtils: @code_warntype

unstable() = rand(Bool) ? 1 : 1.0

test1() = (a = unstable(), b = unstable(), c = unstable())
test2() = @name (a = unstable(), b = unstable(), c = unstable())

@code_warntype test1()
@code_warntype test2()

@bramtayl
Copy link
Contributor Author

bramtayl commented Aug 8, 2019

Although maybe #32699 would solve that?

@bramtayl
Copy link
Contributor Author

bramtayl commented Aug 8, 2019

Well anyways once #32699 comes out I'll try switching LightQuery over to regular NamedTuples and see what happens to performance

@JeffBezanson
Copy link
Member

If you have N elements there are 2^N possible types

I came to the opposite conclusion by the same reasoning, e.g.

The "2^N types" is not an inference issue; it's more fundamental. We have no way to efficiently store such a structure. We'd like to keep a bit for each element saying whether it's Int or Missing, but we can't because neither of the types Tuple{Int} or Tuple{Missing} provides for such a bit, and Tuple{Union{Int,Missing}} is abstract so it doesn't have a layout at all.

We could fix that by having the outer structure (an array of tuples) store all the tuples inline, with N bits next to each specifying which tuple type it has. But we don't have that implemented, and it doesn't really seem worth it --- we would still only infer an abstract type when you index the array, and code_warntype would still be red.

@bramtayl
Copy link
Contributor Author

bramtayl commented Aug 8, 2019

We have no way to efficiently store such a structure

That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators

@quinnj
Copy link
Member

quinnj commented Aug 8, 2019

We have no way to efficiently store such a structure

That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators

I really think #32699 is the answer here; it should give inference just enough information to realize it can avoid materializing the temporary NamedTuple all together and efficiently store the computed values into output columns directly.

@vtjnash
Copy link
Member

vtjnash commented Aug 8, 2019

We have no way to efficiently store such a structure

This might not have been the best choice of word. As Jeff said later, we know how to efficiently store this structure without too much difficulty, the problem is that we can't compute the value of the NamedTuple.

@JeffBezanson
Copy link
Member

That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators

Right; the problem can be avoided by never materializing the tuples. For example you can return a lazy Row{Tuple{Union{Int,Missing}}} type that just contains the index of a row --- but that type is invariant! So I still don't see what work the covariant named tuples are doing here; maybe spell it out more?

@bramtayl
Copy link
Contributor Author

bramtayl commented Aug 8, 2019

The covariant named tuples came out of me experimenting with a way to write named tuples before NamedTuples came about, and realizing that my version of named tuples didn't suffer from #32699 . I thought it was because of covariance but I guess its just a missing optimization?

@bramtayl
Copy link
Contributor Author

bramtayl commented Aug 8, 2019

Reopen if no I guess

@bramtayl bramtayl closed this as completed Aug 8, 2019
@o314
Copy link
Contributor

o314 commented Oct 31, 2019

Missing case apart,
The structdiff @ https://github.com/JuliaLang/julia/blob/v1.3.0-rc4/base/namedtuple.jl#L303 catch an instance or a type by following the generated function way. I wonder if there is a workaround playing with this to bring nametuple covariance.

My own case

@test NamedTuple{(:a,),Tuple{Int64}} <: NamedTuple{NS,T} where {NS,T<:Tuple}
@test_broken Type{NamedTuple{(:a,),Tuple{Int64}}} <: Type{NamedTuple{NS,T} where {NS,T<:Tuple}}

Edited: may be rather due to a Type covariance issue than a namedtuple one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants