Don't make join output DataValueArray unless there are NAs #121

shashi · 2018-02-15T11:45:40Z

No description provided.

…re are going to be nulls

piever · 2018-02-15T12:00:13Z

I used the same strategy in unstack (assume DataValue even if all values are there) following join example: I should probably change it there as well.

shashi · 2018-02-15T12:39:34Z

This PR turned out to be unrelated to _promote_op removal though... What it does is it never creates DataValueArray but instead maintains indices where a region (left or right) of the output is null, then creates DataValueArray at the end.

Getting rid of _promote_op considerably increases the complexity of the _join!... Especially harder once we have missing (i.e. the type may change from the first element to the second) in which case you have to use something like try; push!(..); catch err @goto retry end to meaningfully handle this -- this is a performance disaster.

piever · 2018-02-24T03:21:47Z

We're not the only ones encountering this issue: there is a closely related issue #25925 in Julia Base by @nalimilan: the issue is explained very clearly there but I'm not 100% sure what are the ideas to solve it.

From what I understood the general plan seems to improve inference so that it can figure out which field of the output named tuples could be missing: if this succeeds, preallocate correctly. If this fails (you get type ANY or something unusable), start with some guess and widen as needed.

nalimilan · 2018-03-17T15:27:41Z

Not really knowing what this PR does, I don't think it's related. Inference isn't needed for joins, as the type of the data is perfectly known: it's that of the input (without any function in-between). What we do in DataFrames is that columns have Union{T, Missing} eltype if one of the input columns allowed for missing values, or when some values have to be filled with missing values for joints other than inner (JuliaData/DataFrames.jl#1316).

piever · 2018-03-17T18:39:25Z

The main difference is that join in IndexedTables can accept an arbitrary joining function:

join(f, x, y)

where f defaults to be the concatenation function (joining the two data rows) but it could be an arbitrary function taking two Tuples as input and returning a Tuple.

nalimilan · 2018-03-17T20:12:56Z

I see, thanks. Then indeed the problem is very similar to #25925.

Shashi Gowda added 2 commits February 15, 2018 15:59

don't speculatively create DataValueArray in joins. Do so only if the…

a9ccb1b

…re are going to be nulls

fix bug when using custom function

851207e

shashi merged commit a86c7fc into master Feb 15, 2018

shashi deleted the s/no-na-join branch February 15, 2018 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't make join output DataValueArray unless there are NAs #121

Don't make join output DataValueArray unless there are NAs #121

shashi commented Feb 15, 2018

piever commented Feb 15, 2018

shashi commented Feb 15, 2018

piever commented Feb 24, 2018 •

edited

Loading

nalimilan commented Mar 17, 2018

piever commented Mar 17, 2018

nalimilan commented Mar 17, 2018

Don't make join output DataValueArray unless there are NAs #121

Don't make join output DataValueArray unless there are NAs #121

Conversation

shashi commented Feb 15, 2018

piever commented Feb 15, 2018

shashi commented Feb 15, 2018

piever commented Feb 24, 2018 • edited Loading

nalimilan commented Mar 17, 2018

piever commented Mar 17, 2018

nalimilan commented Mar 17, 2018

piever commented Feb 24, 2018 •

edited

Loading