Skip to content

Conversation

@carpecodeum
Copy link
Contributor

@carpecodeum carpecodeum commented Aug 10, 2025

Which issue does this PR close?

Rationale for this change

This PR implements comprehensive integration tests for Parquet files with Variant columns, using the real test data from parquet-testing PR #90.

Are these changes tested?

Yes

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

No
Thanks to @mprammer

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 10, 2025
@carpecodeum
Copy link
Contributor Author

CC @alamb @scovich

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 12, 2025
@carpecodeum carpecodeum force-pushed the tests-reading-variants branch from 5a6b96b to 3f38b82 Compare August 12, 2025 19:40
@github-actions github-actions bot removed the arrow Changes to the arrow crate label Aug 12, 2025
@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

Thank you @carpecodeum -- this is the first PR on my review queue tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @carpecodeum and @mprammer. I played around with this PR locally and I think it is great.

I had a bunch of suggestions / ideas but I think we can potentially address them all as part of follow on PRs.

So here is my plan

  1. I'll merge this PR
  2. I'll make a PR to move this test to a new testing crate (where we can use the proper dependencies, etc)
  3. If you want, you can help the iterate in that new testing crate (e.g. actually loading Variants via VariantArray)

Thanks again for getting this project rolling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the simple in the name of the test adds much and thus I suggest using a simpler (🥁 ) name: variant_integration.rs

//! Inspired by the arrow-go implementation: https://github.com/apache/arrow-go/pull/455/files

// These tests require the arrow feature
#![cfg(feature = "arrow")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only applies to the first arrow_array use statement

To declare a test requires a particular feature, I think you need to add a entry like this to Cargo.toml for simple_variant_integration

arrow-rs/parquet/Cargo.toml

Lines 162 to 165 in 52fd59c

[[test]]
name = "encryption"
required-features = ["arrow"]
path = "./tests/encryption/mod.rs"

}

// Individual test functions with #[ignore] for progressive enablement
// Following the exact pattern from the PR description
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

/// Load expected Variant binary data from .variant.bin file
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this function is not used, so maybe we can either remove it or change the code above to call it?

Comment on lines +528 to +541
#[ignore] // Enable once parquet-variant dependencies are added
fn test_variant_integration_case_001() {
let harness = VariantIntegrationHarness::new().expect("Failed to create test harness");

let test_case = harness
.test_cases
.iter()
.find(|case| case.case_number == 1)
.expect("case-001 not found");

harness
.run_single_test(test_case)
.expect("case-001 should pass");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern of creating one #test for each input case is very nice and results in easy to debug errors. However, since there will be so many cases (over 100 as I understand) I recommend we try and avoid as much repetition as possible

Could we write something like the following for each case? Then we could make a macro or something to generate all the functions

Suggested change
#[ignore] // Enable once parquet-variant dependencies are added
fn test_variant_integration_case_001() {
let harness = VariantIntegrationHarness::new().expect("Failed to create test harness");
let test_case = harness
.test_cases
.iter()
.find(|case| case.case_number == 1)
.expect("case-001 not found");
harness
.run_single_test(test_case)
.expect("case-001 should pass");
}
#[ignore] // Enable once parquet-variant dependencies are added
fn test_variant_integration_case_001() {
test_harness().run_case(1)
}

}

/// Run a test case that should succeed
fn run_success_test(&self, test_case: &VariantTestCase) -> Result<(), Box<dyn Error>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tests, I personally find it helpful to fail fast if there is a problem

Thus one pattern you might consider here returning () and using expect or unwrap in the test code

So something instead of

        let struct_arrays = self.read_parquet_file(test_case)?;

More like

        let struct_arrays = self.read_parquet_file(test_case).expect("error reading parquet file");

println!(" Parquet file reading failed: {}", e);
if let Some(expected_error) = &test_case.error_message {
println!(" Expected error: {}", expected_error);
// TODO: Match actual error against expected error pattern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure if we want the rust implementation to match the same error messages as iceberg.

Another potential thing would be to call panic! here with the actual error message and then annotate the tests that fail with should_panic, for example here:

#[test]
#[should_panic(expected = "assertion failed: (offset + length) <= self.len()")]
fn test_struct_array_from_data_with_offset_and_length_error() {

parse_cases_json(&content)
}

/// Parse cases.json manually without serde
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason we can't use serde_json here, but maybe we can do that as a follow on

let _metadata = fs::read(&metadata_path)?;
let _value = fs::read(&value_path)?;

// TODO: Parse variant when parquet_variant crate is available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, see my notes above -- I have an idea of how to do this

}

/// Check if a StructArray has the expected Variant structure (metadata, value fields)
fn is_variant_struct_array(&self, struct_array: &StructArray) -> Result<bool, Box<dyn Error>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually, I think this could just reuse VariantArray which does all the same validations

However, to do that we would also need to add a new dependency on parquet-variant, and there is an interesting question about what crates depends on what

I think the idea is that we will NOT add any new dependencies to the main parquet crate. However, since this test harness is in that create, we can't add a dependency on VariantArray

So what I think we should do is as a follow on PR we can move this test into a new (testing only) crate like parquet-variant-integration-testing which can depend on both Variant and VariantArray

@alamb alamb changed the title [VARIANT] add integration tests for variant reads [VARIANT] Initial integration tests for variant reads Aug 13, 2025
@alamb alamb merged commit 10a0610 into apache:main Aug 13, 2025
18 checks passed
mbrobbel added a commit that referenced this pull request Sep 8, 2025
# Which issue does this PR close?

- Closes #8132
- Part of #8084
- Follow on to #8104

# Rationale for this change

TLDR is we need a way to test and work out how Variant integration with
the actual parquet reader/writer will look, so let's do it in the
parquet crate.

Please see the essay on #8132
for background

Follow on tasks (I will file tickets for these items if we agree on this
as an integration mechanism):
- [x] Do not `panic` when writing VariantArray with the ArrowWriter:
#8296
- [ ] Add some way to write the logical annotation to parquet metadata
- [ ] Read arrays annotated with VARIANT logical type as VariantArrays
in ArrowReader
- [x] Update the variant_integration test to use `VariantArray` :
#8084
- [x] Rename `variant_experimental` flag to `variant` and remove
warnings about being experimental:
#8297


Follow up tasks that came out of this PR but do not depend on it
- [x] #8145
- [x] #8144

# What changes are included in this PR?

1. Add the `variant_experimental` feature to the `parquet` crate
2. Publicly export the variant crates
3. Add docs and examples

# Are these changes tested?
Yes by new CI

# Are there any user-facing changes?

This adds a new feature flag, and new

---------

Co-authored-by: Matthijs Brobbel <[email protected]>
alamb added a commit that referenced this pull request Sep 16, 2025
…ing` data (#8325)

# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes #8084

# Rationale for this change

Now that we have merged the upstream parquet-variant tests:
- apache/parquet-testing#91

We can test how far we are from the rust variant implementation working
for all the values

This PR updates the test harness added
#8104 by @carpecodeum to use the
final parquet files and the currnet APIs

# What changes are included in this PR?

1. Update parquet-testing pin
2. Update the test harness to use the standard rust test runner
(`#[test]`) rather than a custom main function
3. Added links to follow on tickets

You can run this test manually like this:

```shell
cargo test --all-features --test variant_integration

...
running 138 tests
test test_variant_integration_case_106 ... ok
test test_variant_integration_case_107 ... ok
test test_variant_integration_case_109 ... ok
test test_variant_integration_case_110 ... ok
..
test test_variant_integration_case_90 ... ok
test test_variant_integration_case_91 ... ok
test test_variant_integration_case_93 ... ok
test test_variant_integration_case_83 - should panic ... ok
test test_variant_integration_case_84 - should panic ... ok
```

# Are these changes tested?
Yes this is all tests

# Are there any user-facing changes?
No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants