-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add Avro Reader projection API #9162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Introduced a new `project` method in `AvroSchema` to support schema projection by field indices. - Enhanced `ReaderBuilder` to accept optional top-level field projection via `.with_projection`. - Updated the `Decoder` to handle effective reader schema pruning for projections. - Added extensive unit tests to validate projection behavior, including edge cases and nested schemas.
088e68b to
2b52765
Compare
|
Thanks for the great feature. I’ll try it out in DataFusion this week, and I’ll leave a comment if I have any further suggestions. |
nathaniel-d-ef
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, glad to have the fix in the right spot.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jecsand838 and @nathaniel-d-ef and @getChan
It will be great to get some confirmation that this works with DataFusion
| .projection | ||
| .as_deref() | ||
| .map(|projection| { | ||
| let base_schema = if let Some(reader_schema) = reader_schema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth pointing out in the comments that the projection is relative the to reader schema if set, otherwise it is relative to whatever is in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a follow-up PR with comments covering this.
| /// assert_eq!(out.schema().field(0).name(), "value"); | ||
| /// assert_eq!(out.schema().field(1).name(), "id"); | ||
| /// # Ok(()) } | ||
| /// ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
| ] | ||
| }"#; | ||
| let schema = AvroSchema::new(schema_json.to_string()); | ||
| // Project in reverse order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it isn't really in reverse order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, I'll clean the comment up.
| assert_eq!(fields[2].get("name").and_then(|n| n.as_str()), Some("f4")); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite a collection of tests
|
I a merging this in to make it easier to test downstream -- let's make any additional improvements as follow on PRs |
|
@getChan @nathaniel-d-ef Let me know if we need any further enhancements and I'll be sure to get those up prior to the this month's major release.
100%. |
Which issue does this PR close?
Rationale for this change
The
arrow-avrocrate'sReaderBuilderpreviously lacked the ability to project (select) specific columns when reading Avro files. This is a common feature in other Arrow readers (likearrow-csvandarrow-ipc) that enables users to read only the columns they need, improving performance and reducing memory usage.What changes are included in this PR?
with_projection(projection: Vec<usize>)method toReaderBuilderthat accepts zero-based column indicesAvroSchema::project()method to create a projected Avro schema with only the selected fieldsAre these changes tested?
Yes, comprehensive tests have been added:
AvroSchema::project()covering:Are there any user-facing changes?
Yes, this adds a new public API method:
This is consistent with the projection API in
arrow-csv::ReaderBuilderandarrow-ipc::FileReaderBuilder. There are no breaking changes to existing APIs.