-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT]: support optional rowgroups to read_parquet
#2534
[FEAT]: support optional rowgroups to read_parquet
#2534
Conversation
Some(row_groups.clone()) | ||
} else { | ||
parquet_sources_to_row_groups(scan_task.sources.as_slice()) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this logic also need to be fixed? I think it should similarly be using the rowgroups in the scantask rather than the rowgroups on the ParquetSourceConfig.
I think it's not triggering right now because your Micropartitions aren't lazy -- they're being eagerly materialized in from_scan_task -> read_parquet_into_micropartition
, so materialize_scan_task
isn't ever being called...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a good way to test the materialize_scan_task
path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually I do it by:
- Adding a filter predicate will trigger the eager materialization path
- Omitting a filter predicate should trigger the lazy
materialize_scan_task
path
I'm actually not sure why in your case it doesn't seem to trigger this code-path... Since you should be in the (2) camp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like it eagerly materializes if the parquet file does not contain statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... I have wondered if it might be a better idea to just always go the lazy route and simplify the code here.
Maybe @samster25 might know why we don't do that?
Definitely very bug prone right now, I've had to tiptoe around this code a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the parquet logic in general has a lot of extra logic branches not in the other file types that make it a bit hard to debug. Definitely a good candidate for some cleanup/refactoring.
closes #2500