Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 23, 2025

For more information, review README.md first. The test cases were generated from Iceberg test cases. The PR for the addition is an open PR on the Iceberg project: apache/iceberg#13654

@rdblue rdblue force-pushed the add-shredded-read-cases branch from 66ea1a9 to 31286ef Compare July 23, 2025 23:42
@rdblue
Copy link
Contributor Author

rdblue commented Jul 24, 2025

"variant_files" : [ null, "case-083_row-1.variant.bin", "case-083_row-2.variant.bin", "case-083_row-3.variant.bin" ],
"variants" : "[null, Variant(metadata=VariantMetadata(dict={0 => a, 1 => b, 2 => c, 3 => d, 4 => e}), value=VariantObject(fields={c: VariantObject(fields={b: Variant(type=STRING, value=iceberg)})})), Variant(metadata=VariantMetadata(dict={0 => a, 1 => b, 2 => c, 3 => d, 4 => e}), value=VariantObject(fields={c: Variant(type=INT8, value=8), d: Variant(type=DOUBLE, value=-0.0)})), Variant(metadata=VariantMetadata(dict={0 => a, 1 => b, 2 => c, 3 => d, 4 => e}), value=VariantObject(fields={c: VariantObject(fields={a: Variant(type=INT32, value=34), b: Variant(type=STRING, value=)}), d: Variant(type=DOUBLE, value=0.0)}))]"
}, {
"case_number" : 84,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It came up on the PR with the go integration test

Test case 84, testShreddedObjectWithOptionalFieldStructs tests the schenario where the shredded fields of an object are listed as optional in the schema, but the spec states that they must be required. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case.

I think this case doesn't have an error_message because it was created wby iceberg which chose (which is allowed per the spec) to still read the invalid data

@rdblue says in apache/arrow-go#455 (comment):

They are not allowed by the spec. The implementation I generated these cases from is defensive and tries to read if it can rather than producing errors. I'd recommend doing the same thing to handle outside-of-spec cases.

Thus, I suggest we resolve the confusion by updating these tests in this PR to make it clearer that they are not valid. For example, @julienledem suggested naming such invalid files as case-084-INVALID.parquet

ANother option might be to add a notes field, something like:

  "notes": "This parquet file is not valid according to the spec and implementations can choose to error, or read the non shredded value",

cc @aihuaxu @RussellSpitzer

alamb pushed a commit to apache/arrow-rs that referenced this pull request Aug 13, 2025
# Which issue does this PR close?

- part of  #8084 .

# Rationale for this change
This PR implements comprehensive integration tests for Parquet files
with Variant columns, using the real test data from parquet-testing PR
#[90](apache/parquet-testing#90).

# Are these changes tested?
Yes

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

# Are there any user-facing changes?

No
Thanks to @mprammer
@alamb
Copy link
Contributor

alamb commented Aug 15, 2025

@aihuaxu has a proposed replacement PR that clearly calls out the error cases:

@alamb
Copy link
Contributor

alamb commented Aug 21, 2025

As @wgtmac merged #91, I think this PR is now superceded and probably should be closed

@wgtmac
Copy link
Member

wgtmac commented Aug 22, 2025

Thanks @alamb for the reminder! Let me close this.

@wgtmac wgtmac closed this Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants