-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC/Julep: Introduce getproperty on Array
for built-in data tables.
#30646
base: master
Are you sure you want to change the base?
Conversation
A simple trick to make our built-in arrays behave as data tables. Automatically broadcasts `getproperty` over array `Array` (greedily via `map`).
This seems tantalizing and thought provoking and obviously incomplete :-) Some quick thoughts. We should have a good native interface for relational algebra, I agree 100% with that. If that ends up involving It seems to me that we'd want a view-based version of this for memory efficiency. Also, ideally, have setindex work, though the difficulties of #11902 seem the same as what we'd have here? Agreed that |
For me that is somewhat the crux here. It encourages people to know and care about the internals of data-types, which up till know for composability we have encouraged not to. I am somewhat fine with that with the category of types that are records/rows, but already the Complex example it makes me hesitant about this. The slightly crazy proposal would be to make |
I wanted to point out that all of this is already implemented in StructArrays: julia> using StructArrays
julia> table = StructArray([(a=1, b=true), (a=2, b=false), (a=3, b=true)])
3-element StructArray{NamedTuple{(:a, :b),Tuple{Int64,Bool}},1,NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Bool,1}}}}:
(a = 1, b = true)
(a = 2, b = false)
(a = 3, b = true)
julia> table.a
3-element Array{Int64,1}:
1
2
3
julia> table.b
3-element Array{Bool,1}:
true
false
true
julia> a = StructArray{Complex{Float64}}(undef, 2, 2);
julia> rand!(a)
2×2 StructArray{Complex{Float64},2,NamedTuple{(:re, :im),Tuple{Array{Float64,2},Array{Float64,2}}}}:
0.212628+0.705269im 0.396557+0.974033im
0.452444+0.0929322im 0.42598+0.739531im
julia> a.re
2×2 Array{Float64,2}:
0.212628 0.396557
0.452444 0.42598 with the extra advantage that storage is column-wise, so that The reason why using So a counter-proposal would be, rather than using
I hope the "counter-proposal" does not derail the discussion, I wrote it here as I think it's relevant in that in my view it would make this PR less necessary. |
Well said, @c42f , I agree. I don't like this specific proposal, since it is basically a "vectorized" function of the sort we don't tend to have anymore. Now, I know the alternative
That would be totally fine. Just to state the obvious though, an array of NamedTuples should have the same interface. It would be nice to have something like StructArrays more integrated with Base functions, as you describe. |
Query.jl also uses this kind of construct for groups: if one groups a table, one gets an array of I think that it makes for a very neat interface in this (and other) table context. But, I'm not convinced that this should be made available in base for all arrays. The examples with e.g. complex numbers seem really cases that we don't want to think of as a table. I think I'm generally not convinced that we should generically treat |
I thought of this as quite different to the vectorized functions that we got rid of. To me, this is more in the spirit of We have overloaded Upon usage, it feels very natural to use the same function extract a column of table by its name as to extract a cell from a row by its name. I speculate that there is possibly some simple algebra of tables/columns/rows/cell values that you could write down in a formal sense. Just like
To me, this is key, to make it an interface different implementations can share. And yes, we can introduce more efficient implemetations (along the lines of
Gosh, this really wasn't the intention... if you should not mess with the fields of one element, you definitely should not mess with the fields of all the elements! Again, we don't restrict |
The symmetry of My gut feeling so far about this proposal for |
I would like to address this - it is a very reasonable concern and gets to the crux of why I wanted to submit this PR and open a discussion. A little while after starting Julia, I was wondering: why do we pile all this linear algebra stuff on top of These days, I realize the joy and productivity of Julia has something to do with how rich the interface for arrays (and other basic types) are. You can use really simple syntax to define a container like an Now, to address being "specific to data analysis" - I actually feel this a very strategic target for us as a technical language, equally important as linear algebra. Comparing against python, for example, its my opinion that Thus, I'm imagining a future where it's ridiculously easy to create, access and manipulate a table. As easy as linear algebra is now. Really, for the strategic direciton I'd absolutely have to bow to Jeff, Stefan, Viral and so-on, but to me it seems that it could potentially be worthwhile to have a "linear algebra" level of intergration between arrays and data analysis. (Whether or not that is precisely what is suggested in this PR is a different story, of course, but I saw the |
Yes, exactly! What you've got here addresses (in prototype) the access case when you happen to have an array of named tuples. On the other hand, if we made it so that common constructions like comprehensions of named tuples could somehow return a But for comprehensions that would mean breaking compatibility in returning something which is not an I'm not sure whether it's been done to death in data circles, but is it also worth mentioning the availability of the |
Keep in mind that row-based storage of tables is still valid and entirely useful (and not to be treated as second-class), and should definitely follow the same interface as columnar storage, even if/when columnar storage makes it to IMO the idea of having a Like Jeff said, "Just to state the obvious though, an array of NamedTuples should have the same interface." If you look at Tables.jl however it seems that this interface currently is
I wonder if this would be suitable for the columnar-storage version of |
This is an interesting point, to find a common interface I think there are three options:
UPDATE: rewrote the example with a postfix operator as I think it works better with |
Just for the record: my current thinking about |
Another possible syntax would be something like I would tend to prefer something where the syntax indicates the type of operation, as opposed to |
Ooo! I really like that syntax. |
We could potentially make |
One thing to keep in mind is auto complete. It would be nice if proposal could at some point lead to a good auto complete story in IDEs. I do think that means something like |
Maybe a good syntax is just something like |
On the other hand there's something incredibly appealing and natural about the symmetry of Anyway, if the
|
Or better, simply |
Perhaps using This line of thought leads to the following possible solution:
If we could solve the private vs public interface problem at the syntactic level, we could much more happily overload |
@c42f, the issue to me is not public vs. private, it is broadcasted vs. non-broadcasted. Even if |
Exactly - I actually expected that people would think it's crazy that all This PR should not change the interface you use on the elements (or what consititutes good practice for interacting with the elements from the perspective of software maintainability). |
Regarding
Haha this reminded me of a crazy experiment I did once. It's very interesting to me that making (s::Symbol)(x) = getproperty(x, s) This gives things like map(:a, table) # extract column `a` from `table`
:a.(table) # like above, with broadcast
count(:a, table) # count how many `true`s in column `a` of `table`
filter(:a, table) # filter rows of table where row.a == true
# from SplitApplyCombine
group(:a, :b, table) # group table.b with grouping keys from table.a
innerjoin(:a, :b, table1, table) # join tables on `table1.a == table2.b` |
Callable symbols are an idiom in Clojure (where they're called keywords):
(def population {:zombies 2700, :humans 9})
(:zombies population)
;=> 2700 |
Of course not. But looking at the reasons for not using |
Not entirely --- the data ecosystem already uses |
The ability to fuse with dot calls or transform to a view with |
Very true, which is why I suggested a function called |
As a variation to @stevengj's idea and sticking with that what we need is a broadcasted version of table :: Vector{NamedTuple{(:average, :SD),Tuple{Float64,Float64}}}
table.(.average) # get :average column
table.(.SD ./ .average) # compute CV (w/o materializing :average and :SD columns)
table[.average .> 1] # Boolean indexing (w/o materializing the Boolean vector) (They throw parse error ATM.)
I suggested (I expect there are many pandas users who don't like prefixing every column name with dataframe name. I guess it would be appealing for such users.) |
@tkf Now we're talking! This syntax seems tricky and "inside out" compared to the usual broadcasting of a function over a collection. But it's incredibly appealing if we can have a broadcast-driven syntax which scopes the fields to their tabular context. In particular, can this or something similar express all the unary tabular operations of the relational algebra? It's kind of close. |
I'm a noob when it comes to the relational algebra, but reading Wikipedia pages, how about Projection: unique((row -> (Age = row.Age, Weight = row.Weight)).(table)) Selection: table[(row -> row.Age == row.Weight).(table)] Rename... seems to require a function to do it but can already be done? I imagine you can do it by using a function of the form
Yeah, to be honest, I thought this was a bit odd syntax too. So I wondered if it was possible to "derive" this syntax from smaller sub-rules. Here is something I came up:
From those rules, I think we have
This let us do Note that y = x(.a = value) to l = .a
s = Setter(l, value)
y = x(s) which is defined to be equivalent to l = .a
y = set(x, l, value) i.e., replace the field This can be handy for updating columns like Some more thoughts:
Footnotes
|
Split-apply-combine already has |
Off-topic: I really like the semantics of invert. |
This is a bit of a fun one, while the v1.2 development cycle is young. I'm not sure if this approach will receive universal support, but discussion of this may be helpful, so here goes.
Julia is an awesome language for maths with arrays (linear algebra) and for a while now I've been wondering if we can make it equally ergonomic and awesome for other typical data operations with array-like data structures, such as tables. For example, relations are collections of (named) tuples and
Array{<:NamedTuple}
seems like a relatively natural candidate to behave as a relation. In fact, the community (e.g. Tables.jl) generally seems to be trending towards "table rows are things that we can dogetproperty
on to get cell values" and "table columns iterate cell values", and from tables themselves we might want to extract rows or columns via iteration orgetproperty
, respectively.This PR uses a simple trick to make our built-in arrays behave more friendly as data tables. It automatically broadcasts
getproperty
over arrayArray
(greedily viamap
, in this basic implementation, though we theoretically can do a lazy approach and supportsetproperty!
and so-on). Here's a simple example:And finally, it's notable that this is useful for general data manipulation and broader contexts. It works on structs and so-on, not just
NamedTuple
elements. For example, for doing complex linear algebra this is relatively nice:Looking forward speculatively, I wonder if we could make this a part of the
AbstractArray
spec for Julia 2.0 (obviously a breaking change) where library writers would usegetfield
and helper functions to manipulate custom array internals, and mostly external users don't directly access the fields/properties of arrays anyway.