Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: support optional rowgroups to read_parquet #2534

Merged

Conversation

universalmind303
Copy link
Collaborator

closes #2500

@universalmind303 universalmind303 marked this pull request as ready for review July 18, 2024 15:14
@github-actions github-actions bot added the enhancement New feature or request label Jul 18, 2024
daft/io/_parquet.py Show resolved Hide resolved
daft/io/_parquet.py Outdated Show resolved Hide resolved
Some(row_groups.clone())
} else {
parquet_sources_to_row_groups(scan_task.sources.as_slice())
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this logic also need to be fixed? I think it should similarly be using the rowgroups in the scantask rather than the rowgroups on the ParquetSourceConfig.

I think it's not triggering right now because your Micropartitions aren't lazy -- they're being eagerly materialized in from_scan_task -> read_parquet_into_micropartition, so materialize_scan_task isn't ever being called...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a good way to test the materialize_scan_task path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I do it by:

  1. Adding a filter predicate will trigger the eager materialization path
  2. Omitting a filter predicate should trigger the lazy materialize_scan_task path

I'm actually not sure why in your case it doesn't seem to trigger this code-path... Since you should be in the (2) camp.

Copy link
Collaborator Author

@universalmind303 universalmind303 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like it eagerly materializes if the parquet file does not contain statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... I have wondered if it might be a better idea to just always go the lazy route and simplify the code here.

Maybe @samster25 might know why we don't do that?

Definitely very bug prone right now, I've had to tiptoe around this code a lot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the parquet logic in general has a lot of extra logic branches not in the other file types that make it a bit hard to debug. Definitely a good candidate for some cleanup/refactoring.

src/daft-scan/src/anonymous.rs Outdated Show resolved Hide resolved
@universalmind303 universalmind303 merged commit 9cd1482 into Eventual-Inc:main Jul 19, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read specific row group from Parquets
2 participants