Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NamedTuples with lots of missing data #36712

Closed
bramtayl opened this issue Jul 17, 2020 · 8 comments
Closed

NamedTuples with lots of missing data #36712

bramtayl opened this issue Jul 17, 2020 · 8 comments

Comments

@bramtayl
Copy link
Contributor

bramtayl commented Jul 17, 2020

Given that rows in tables are essentially named tuples of a bunch of values, many or most of which are of type Union{Missing, T}, I think it would be good to prioritize dealing with these kind of types in a reasonable way if Julia is aiming to be used for data science. I've been waiting a while for #32699 but it doesn't seemed to have helped much. For example

function test()
    data = (a = if rand(Bool)
        1
    else
        "1"
    end,)
    merge(data, (b = data.a,))
end
@code_warntype test() # NamedTuple{(:a, :b),_A} where _A<:Tuple

Also, #31909 means that you can't use union missing data as captures. Given that these types are kind of messy, it would be understandable if Base cannot support them, but if so, please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues. Apologies if there is an issue for this open already.

@KristofferC
Copy link
Member

KristofferC commented Jul 17, 2020

I think it would be good to prioritize dealing with these kind of types in a reasonable way if Julia is aiming to be used for data science

if Base cannot support them, but if so, please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues.

It's very unclear what this issue is about with all these fuzzy statements and claims that Julia should somehow announce that you shouldn't use unions in data science or something. Please reopen an issue written in such a way that it is clear what the issue is, (what code ran, what was the result, what was the expected result), leaving out irrelevant side comments and ultimatums.

@bramtayl
Copy link
Contributor Author

Which comments are irrelevant? I have some code and the result above???

@KristofferC
Copy link
Member

I quoted them.

Again, feel free to open a new issue where you leave out that stuff. A good format is:

Title: Imprecise inference when merging a union value into a NamedTuple of unions
Running this code:

code

the returned value from inference is ....

I expect it to be ....

This would be useful for ...

@bramtayl
Copy link
Contributor Author

Ok, well, I still don't see why those comments aren't relevant, but see #36713

@quinnj
Copy link
Member

quinnj commented Jul 17, 2020

I agree this issue lacks clarity:

  • "NamedTuples with lots of missing data" isn't a very clear issue title; it's just a statement; is there something wrong? is something not inferring? Are you referring to NamedTuples with missing values or potentially missing types like Union{T, Missing}?
  • "please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues"; the tone of this statement feels inappropriate, like you're giving an ultimatum to core developers or something. Is there a consensus among data ecosystem developers that this single issue (whatever it is?) is a make or break issue for using NamedTuples as rows? Or using missing to represent missing data? I bring this up because as a data ecosystem developer, I'm not aware of the direness of this issue or how it threatens to change everything.

As for the actual issue, it seems like the request is to just extend some of the inference improvements from #32699 to apply in more cases? These kinds of issues are definitely tricky to spell out exactly; #32699 originally came out at last year's JuliaCon when @JeffBezanson and I sat down for a half hour and walked through some code; it was only then that we were both able to fully communicate and understand what the issue was and what we could reasonably do in Base. I'm not saying that you need a private meeting with Jeff to get anything done, but I'm just trying to emphasize that #32699 was really hard for me to explain or write up in an issue, because there are lots of steps and factors that come into play. I think it'd be helpful in this case to provide as much context as possible: what is the code? what isn't inferring like you'd expect? how is the specific uninferrable case affecting larger code flow?

@bramtayl
Copy link
Contributor Author

If you would like context, you can see the benchmarks.jl file of LightQuery on master. This code:

using LightQuery: @>, @_, By, Group, over, make_columns, @name_str, Rows, value

using CSV: File
using DataFrames: DataFrame, groupby, combine
using Tables: columntable

cd("/home/brandon/benchmark")
download("http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz", "mortgage_2000.tgz")
run(`tar --gzip --extract --file=mortgage_2000.tgz`)
cd("perf")

file = File(
    "Performance_2000Q1.txt",
    delim = '|',
    header = Symbol.(string.("Column", 1:31)),
    missingstrings = ["NULL", ""],
    dateformat = "mm/dd/yyyy",
    truestrings = ["Y"],
    falsestrings = ["N"],
)

# just the type stable ones
columns = (name"Column1", name"Column2", name"Column4", name"Column6", name"Column10", name"Column11", name"Column12")(columntable(file))

function process_with_lightquery(columns)
    @> Rows(; columns...) |>
    Group(By(_, name"Column1")) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

@time process_with_lightquery(columns)

This code runs in 0.4 seconds (beating DataFrames). As as I add column 3, which has an eltype of Union{Missing, T}, the code runs...indefinitely? Longer than 2 minutes

@StefanKarpinski
Copy link
Member

Ok, now that's a concrete, actionable issue. In the future, please start with that.

@bramtayl
Copy link
Contributor Author

Ok, will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants