Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Allow nulls in partition column #2344

Merged
merged 3 commits into from
Jun 6, 2024
Merged

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Jun 6, 2024

Closes #2292

Currently Daft panics if there are nulls in a partition column, the detailed error message can be found in the linked issue.

A simple reproduction:

from deltalake import write_deltalake
import pandas as pd
import daft

df = pd.DataFrame(
    {
        "group": [1, 2, 3, None],
        "num": list(range(4)),
    }
)
write_deltalake("z", df, partition_by="group", mode="overwrite")

df = daft.read_deltalake("z")
df.show()

This PR modifies the partition spec equality logic and partition pruning semantics to allow reading nulls in partition columns.

@github-actions github-actions bot added the bug Something isn't working label Jun 6, 2024
Copy link

codecov bot commented Jun 6, 2024

Codecov Report

Attention: Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@408f977). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2344   +/-   ##
=======================================
  Coverage        ?   79.00%           
=======================================
  Files           ?      475           
  Lines           ?    55264           
  Branches        ?        0           
=======================================
  Hits            ?    43661           
  Misses          ?    11603           
  Partials        ?        0           
Files Coverage Δ
src/daft-scan/src/python.rs 67.48% <100.00%> (ø)
src/daft-stats/src/partition_spec.rs 78.94% <77.77%> (ø)

return false;
}
} else {
let both_null = self_column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for partition spec we want the behavior that null == null? if so, we should document that here

if !value_eq {
return false;
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a simpler way to represent this could be:

else if self_is_null xor other_is_null {
  return false;
}

@colin-ho colin-ho force-pushed the colin/nulls-in-partition-col branch from 17cc5f8 to c794012 Compare June 6, 2024 15:53
@colin-ho colin-ho requested a review from samster25 June 6, 2024 16:24
@colin-ho colin-ho merged commit 87f6706 into main Jun 6, 2024
44 checks passed
@colin-ho colin-ho deleted the colin/nulls-in-partition-col branch June 6, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when reading data from Delta Lake table on S3
2 participants