Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't make join output DataValueArray unless there are NAs #121

Merged
merged 2 commits into from
Feb 15, 2018

Conversation

shashi
Copy link
Collaborator

@shashi shashi commented Feb 15, 2018

No description provided.

@piever
Copy link
Collaborator

piever commented Feb 15, 2018

I used the same strategy in unstack (assume DataValue even if all values are there) following join example: I should probably change it there as well.

@shashi
Copy link
Collaborator Author

shashi commented Feb 15, 2018

This PR turned out to be unrelated to _promote_op removal though... What it does is it never creates DataValueArray but instead maintains indices where a region (left or right) of the output is null, then creates DataValueArray at the end.

Getting rid of _promote_op considerably increases the complexity of the _join!... Especially harder once we have missing (i.e. the type may change from the first element to the second) in which case you have to use something like try; push!(..); catch err @goto retry end to meaningfully handle this -- this is a performance disaster.

@shashi shashi merged commit a86c7fc into master Feb 15, 2018
@shashi shashi deleted the s/no-na-join branch February 15, 2018 12:49
@piever
Copy link
Collaborator

piever commented Feb 24, 2018

We're not the only ones encountering this issue: there is a closely related issue #25925 in Julia Base by @nalimilan: the issue is explained very clearly there but I'm not 100% sure what are the ideas to solve it.

From what I understood the general plan seems to improve inference so that it can figure out which field of the output named tuples could be missing: if this succeeds, preallocate correctly. If this fails (you get type ANY or something unusable), start with some guess and widen as needed.

@nalimilan
Copy link
Member

Not really knowing what this PR does, I don't think it's related. Inference isn't needed for joins, as the type of the data is perfectly known: it's that of the input (without any function in-between). What we do in DataFrames is that columns have Union{T, Missing} eltype if one of the input columns allowed for missing values, or when some values have to be filled with missing values for joints other than inner (JuliaData/DataFrames.jl#1316).

@piever
Copy link
Collaborator

piever commented Mar 17, 2018

The main difference is that join in IndexedTables can accept an arbitrary joining function:

join(f, x, y)

where f defaults to be the concatenation function (joining the two data rows) but it could be an arbitrary function taking two Tuples as input and returning a Tuple.

@nalimilan
Copy link
Member

I see, thanks. Then indeed the problem is very similar to #25925.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants