Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The arrow-avro crate's ReaderBuilder previously lacked the ability to project (select) specific columns when reading Avro files. This is a common feature in other Arrow readers (like arrow-csv and arrow-ipc) that enables users to read only the columns they need, improving performance and reducing memory usage.

What changes are included in this PR?

  • Added a with_projection(projection: Vec<usize>) method to ReaderBuilder that accepts zero-based column indices
  • Implemented AvroSchema::project() method to create a projected Avro schema with only the selected fields
  • The projection supports:
    • Selecting a subset of fields
    • Reordering fields
    • Preserving all record and field metadata (namespace, doc, defaults, aliases, etc.)
    • Preserving nested/complex types (records, arrays, maps, unions)
  • Added validation for out-of-bounds indices and duplicate indices

Are these changes tested?

Yes, comprehensive tests have been added:

  • Unit tests for AvroSchema::project() covering:
    • Empty projections
    • Single and multiple field selection
    • Field reordering
    • Metadata preservation (record-level and field-level)
    • Nested records and complex types (arrays, maps, unions)
    • Error cases (invalid JSON, non-record schemas, out-of-bounds indices, duplicate indices)
  • Integration tests in the reader module for end-to-end projection with OCF files

Are there any user-facing changes?

Yes, this adds a new public API method:

impl ReaderBuilder {
    /// Set a projection of columns to read (zero-based column indices).
    pub fn with_projection(self, projection: Vec<usize>) -> Self
}

This is consistent with the projection API in arrow-csv::ReaderBuilder and arrow-ipc::FileReaderBuilder. There are no breaking changes to existing APIs.

- Introduced a new `project` method in `AvroSchema` to support schema projection by field indices.
- Enhanced `ReaderBuilder` to accept optional top-level field projection via `.with_projection`.
- Updated the `Decoder` to handle effective reader schema pruning for projections.
- Added extensive unit tests to validate projection behavior, including edge cases and nested schemas.
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 13, 2026
@getChan
Copy link

getChan commented Jan 14, 2026

Thanks for the great feature. I’ll try it out in DataFusion this week, and I’ll leave a comment if I have any further suggestions.

Copy link
Contributor

@nathaniel-d-ef nathaniel-d-ef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, glad to have the fix in the right spot.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jecsand838 and @nathaniel-d-ef and @getChan

It will be great to get some confirmation that this works with DataFusion

.projection
.as_deref()
.map(|projection| {
let base_schema = if let Some(reader_schema) = reader_schema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth pointing out in the comments that the projection is relative the to reader schema if set, otherwise it is relative to whatever is in the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a follow-up PR with comments covering this.

/// assert_eq!(out.schema().field(0).name(), "value");
/// assert_eq!(out.schema().field(1).name(), "id");
/// # Ok(()) }
/// ```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

]
}"#;
let schema = AvroSchema::new(schema_json.to_string());
// Project in reverse order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it isn't really in reverse order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, I'll clean the comment up.

assert_eq!(fields[2].get("name").and_then(|n| n.as_str()), Some("f4"));
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite a collection of tests

@alamb alamb merged commit 8b67776 into apache:main Jan 15, 2026
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 15, 2026

I a merging this in to make it easier to test downstream -- let's make any additional improvements as follow on PRs

@jecsand838
Copy link
Contributor Author

@getChan @nathaniel-d-ef Let me know if we need any further enhancements and I'll be sure to get those up prior to the this month's major release.

I a merging this in to make it easier to test downstream -- let's make any additional improvements as follow on PRs

100%.

@jecsand838 jecsand838 deleted the avro-reader-projection branch January 15, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-avro] Add Explicit Projection API to ReaderBuilder

4 participants