[VARIANT] Initial integration tests for variant reads #8104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

alamb merged 3 commits into apache:main from cmu-db:tests-reading-variants

Aug 13, 2025

Contributor

carpecodeum commented Aug 10, 2025 •

edited by alamb

Loading

Which issue does this PR close?

part of [Variant] Integration tests for reading parquet w/ Variants #8084 .

Rationale for this change

This PR implements comprehensive integration tests for Parquet files with Variant columns, using the real test data from parquet-testing PR #90.

Are these changes tested?

Yes

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

No
Thanks to @mprammer


          [ADD] add integration tests for variant testing

c002613

github-actions bot added the parquet label

Contributor Author

carpecodeum commented Aug 10, 2025

CC @alamb @scovich


          [FIX] fix issues

8ae4b42

github-actions bot added the arrow label


          [FIX] fix clippy and linting issues

3f38b82

carpecodeum force-pushed the tests-reading-variants branch from 5a6b96b to 3f38b82 Compare

August 12, 2025 19:40

github-actions bot removed the arrow label

Contributor

alamb commented Aug 12, 2025

Thank you @carpecodeum -- this is the first PR on my review queue tomorrow

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

Thank you so much @carpecodeum and @mprammer. I played around with this PR locally and I think it is great.

I had a bunch of suggestions / ideas but I think we can potentially address them all as part of follow on PRs.

So here is my plan

I'll merge this PR
I'll make a PR to move this test to a new testing crate (where we can use the proper dependencies, etc)
If you want, you can help the iterate in that new testing crate (e.g. actually loading Variants via VariantArray)

Thanks again for getting this project rolling

parquet/tests/simple_variant_integration.rs

Contributor

alamb Aug 13, 2025

I don't think the simple in the name of the test adds much and thus I suggest using a simpler (🥁 ) name: variant_integration.rs

parquet/tests/simple_variant_integration.rs

+              //! Inspired by the arrow-go implementation: https://github.com/apache/arrow-go/pull/455/files
+              // These tests require the arrow feature
+              #![cfg(feature = "arrow")]

Contributor

alamb Aug 13, 2025

I think this only applies to the first arrow_array use statement

To declare a test requires a particular feature, I think you need to add a entry like this to Cargo.toml for simple_variant_integration

arrow-rs/parquet/Cargo.toml

Lines 162 to 165 in 52fd59c

    
           [[test]] 
        
           name = "encryption" 
        
           required-features = ["arrow"] 
        
           path = "./tests/encryption/mod.rs"

parquet/tests/simple_variant_integration.rs

+              }
+              // Individual test functions with #[ignore] for progressive enablement
+              // Following the exact pattern from the PR description

Contributor

alamb Aug 13, 2025

👍

parquet/tests/simple_variant_integration.rs

+                  }
+                  /// Load expected Variant binary data from .variant.bin file
+                  #[allow(dead_code)]

Contributor

alamb Aug 13, 2025

I am pretty sure this function is not used, so maybe we can either remove it or change the code above to call it?

parquet/tests/simple_variant_integration.rs

Comment on lines +528 to +541

+              #[ignore] // Enable once parquet-variant dependencies are added
+              fn test_variant_integration_case_001() {
+                  let harness = VariantIntegrationHarness::new().expect("Failed to create test harness");
+                  let test_case = harness
+                      .test_cases
+                      .iter()
+                      .find(|case| case.case_number == 1)
+                      .expect("case-001 not found");
+                  harness
+                      .run_single_test(test_case)
+                      .expect("case-001 should pass");
+              }

Contributor

alamb Aug 13, 2025

I think this pattern of creating one #test for each input case is very nice and results in easy to debug errors. However, since there will be so many cases (over 100 as I understand) I recommend we try and avoid as much repetition as possible

Could we write something like the following for each case? Then we could make a macro or something to generate all the functions

Suggested change

      
            #[ignore] // Enable once parquet-variant dependencies are added
          
            fn test_variant_integration_case_001() {
          
                let harness = VariantIntegrationHarness::new().expect("Failed to create test harness");
          
                let test_case = harness
          
                    .test_cases
          
                    .iter()
          
                    .find(|case| case.case_number == 1)
          
                    .expect("case-001 not found");
          
                harness
          
                    .run_single_test(test_case)
          
                    .expect("case-001 should pass");
          
            }
          
            #[ignore] // Enable once parquet-variant dependencies are added
          
            fn test_variant_integration_case_001() {
          
                test_harness().run_case(1)
          
            }

parquet/tests/simple_variant_integration.rs

+                  }
+                  /// Run a test case that should succeed
+                  fn run_success_test(&self, test_case: &VariantTestCase) -> Result<(), Box<dyn Error>> {

Contributor

alamb Aug 13, 2025

In tests, I personally find it helpful to fail fast if there is a problem

Thus one pattern you might consider here returning () and using expect or unwrap in the test code

So something instead of

        let struct_arrays = self.read_parquet_file(test_case)?;

More like

        let struct_arrays = self.read_parquet_file(test_case).expect("error reading parquet file");

parquet/tests/simple_variant_integration.rs

+                              println!("    Parquet file reading failed: {}", e);
+                              if let Some(expected_error) = &test_case.error_message {
+                                  println!("    Expected error: {}", expected_error);
+                                  // TODO: Match actual error against expected error pattern

Contributor

alamb Aug 13, 2025

I am not quite sure if we want the rust implementation to match the same error messages as iceberg.

Another potential thing would be to call panic! here with the actual error message and then annotate the tests that fail with should_panic, for example here:

arrow-rs/arrow-array/src/array/struct_array.rs

Lines 644 to 646 in 13fc3c8

    
           #[test] 
        
           #[should_panic(expected = "assertion failed: (offset + length) <= self.len()")] 
        
           fn test_struct_array_from_data_with_offset_and_length_error() {

parquet/tests/simple_variant_integration.rs

+                  parse_cases_json(&content)
+              }
+              /// Parse cases.json manually without serde

Contributor

alamb Aug 13, 2025

I don't see any reason we can't use serde_json here, but maybe we can do that as a follow on

parquet/tests/simple_variant_integration.rs

+                  let _metadata = fs::read(&metadata_path)?;
+                  let _value = fs::read(&value_path)?;
+                  // TODO: Parse variant when parquet_variant crate is available

Contributor

alamb Aug 13, 2025

yeah, see my notes above -- I have an idea of how to do this

parquet/tests/simple_variant_integration.rs

+                  }
+                  /// Check if a StructArray has the expected Variant structure (metadata, value fields)
+                  fn is_variant_struct_array(&self, struct_array: &StructArray) -> Result<bool, Box<dyn Error>> {

Contributor

alamb Aug 13, 2025

eventually, I think this could just reuse VariantArray which does all the same validations

However, to do that we would also need to add a new dependency on parquet-variant, and there is an interesting question about what crates depends on what

I think the idea is that we will NOT add any new dependencies to the main parquet crate. However, since this test harness is in that create, we can't add a dependency on VariantArray

So what I think we should do is as a follow on PR we can move this test into a new (testing only) crate like parquet-variant-integration-testing which can depend on both Variant and VariantArray

alamb changed the title ~~[VARIANT] add integration tests for variant reads~~ [VARIANT] Initial integration tests for variant reads

alamb merged commit 10a0610 into apache:main

18 checks passed

This was referenced Aug 13, 2025

[Variant] Introduce parquet-variant-integration-testing crate #8129

Closed

[Variant] Integration tests for reading parquet w/ Variants #8084

Closed

[Variant] Add variant feature to parquet crate #8132

Closed

Add variant_experimental feature to parquet crate #8133

Merged

carpecodeum mentioned this pull request

Add shredded Variant reader cases with variant logical type apache/parquet-testing#91

Merged

mbrobbel added a commit that referenced this pull request


          Add variant_experimental feature to parquet crate (#8133)

d9a4b39

# Which issue does this PR close?

- Closes #8132
- Part of #8084
- Follow on to #8104

# Rationale for this change

TLDR is we need a way to test and work out how Variant integration with
the actual parquet reader/writer will look, so let's do it in the
parquet crate.

Please see the essay on #8132
for background

Follow on tasks (I will file tickets for these items if we agree on this
as an integration mechanism):
- [x] Do not `panic` when writing VariantArray with the ArrowWriter:
#8296
- [ ] Add some way to write the logical annotation to parquet metadata
- [ ] Read arrays annotated with VARIANT logical type as VariantArrays
in ArrowReader
- [x] Update the variant_integration test to use `VariantArray` :
#8084
- [x] Rename `variant_experimental` flag to `variant` and remove
warnings about being experimental:
#8297


Follow up tasks that came out of this PR but do not depend on it
- [x] #8145
- [x] #8144

# What changes are included in this PR?

1. Add the `variant_experimental` feature to the `parquet` crate
2. Publicly export the variant crates
3. Add docs and examples

# Are these changes tested?
Yes by new CI

# Are there any user-facing changes?

This adds a new feature flag, and new

---------

Co-authored-by: Matthijs Brobbel <[email protected]>

alamb mentioned this pull request

Update variant_integration test to use final approved parquet-testing data #8325

Merged

alamb added a commit that referenced this pull request


          Update variant_integration test to use final approved `parquet-test…

2ec77b5

…ing` data (#8325)

# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes #8084

# Rationale for this change

Now that we have merged the upstream parquet-variant tests:
- apache/parquet-testing#91

We can test how far we are from the rust variant implementation working
for all the values

This PR updates the test harness added
#8104 by @carpecodeum to use the
final parquet files and the currnet APIs

# What changes are included in this PR?

1. Update parquet-testing pin
2. Update the test harness to use the standard rust test runner
(`#[test]`) rather than a custom main function
3. Added links to follow on tickets

You can run this test manually like this:

```shell
cargo test --all-features --test variant_integration

...
running 138 tests
test test_variant_integration_case_106 ... ok
test test_variant_integration_case_107 ... ok
test test_variant_integration_case_109 ... ok
test test_variant_integration_case_110 ... ok
..
test test_variant_integration_case_90 ... ok
test test_variant_integration_case_91 ... ok
test test_variant_integration_case_93 ... ok
test test_variant_integration_case_83 - should panic ... ok
test test_variant_integration_case_84 - should panic ... ok
```

# Are these changes tested?
Yes this is all tests

# Are there any user-facing changes?
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet