-
Notifications
You must be signed in to change notification settings - Fork 70
Add shredded Variant reader test cases #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
66ea1a9 to
31286ef
Compare
|
I've also added example tests to Iceberg that run the cases defined here: https://github.com/apache/iceberg/pull/13654/files#diff-67a89b76a5ea72983fb3e6e564b824984ab934c893423f80ef0aa52f4f785394 |
| "variant_files" : [ null, "case-083_row-1.variant.bin", "case-083_row-2.variant.bin", "case-083_row-3.variant.bin" ], | ||
| "variants" : "[null, Variant(metadata=VariantMetadata(dict={0 => a, 1 => b, 2 => c, 3 => d, 4 => e}), value=VariantObject(fields={c: VariantObject(fields={b: Variant(type=STRING, value=iceberg)})})), Variant(metadata=VariantMetadata(dict={0 => a, 1 => b, 2 => c, 3 => d, 4 => e}), value=VariantObject(fields={c: Variant(type=INT8, value=8), d: Variant(type=DOUBLE, value=-0.0)})), Variant(metadata=VariantMetadata(dict={0 => a, 1 => b, 2 => c, 3 => d, 4 => e}), value=VariantObject(fields={c: VariantObject(fields={a: Variant(type=INT32, value=34), b: Variant(type=STRING, value=)}), d: Variant(type=DOUBLE, value=0.0)}))]" | ||
| }, { | ||
| "case_number" : 84, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It came up on the PR with the go integration test
Test case 84, testShreddedObjectWithOptionalFieldStructs tests the schenario where the shredded fields of an object are listed as optional in the schema, but the spec states that they must be required. Thus, the Go implementation errors on this test as the spec says this is an error. Clarification is needed on if this is a valid test case.
I think this case doesn't have an error_message because it was created wby iceberg which chose (which is allowed per the spec) to still read the invalid data
@rdblue says in apache/arrow-go#455 (comment):
They are not allowed by the spec. The implementation I generated these cases from is defensive and tries to read if it can rather than producing errors. I'd recommend doing the same thing to handle outside-of-spec cases.
Thus, I suggest we resolve the confusion by updating these tests in this PR to make it clearer that they are not valid. For example, @julienledem suggested naming such invalid files as case-084-INVALID.parquet
ANother option might be to add a notes field, something like:
"notes": "This parquet file is not valid according to the spec and implementations can choose to error, or read the non shredded value",# Which issue does this PR close? - part of #8084 . # Rationale for this change This PR implements comprehensive integration tests for Parquet files with Variant columns, using the real test data from parquet-testing PR #[90](apache/parquet-testing#90). # Are these changes tested? Yes If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? No Thanks to @mprammer
|
@aihuaxu has a proposed replacement PR that clearly calls out the error cases: |
|
Thanks @alamb for the reminder! Let me close this. |
For more information, review README.md first. The test cases were generated from Iceberg test cases. The PR for the addition is an open PR on the Iceberg project: apache/iceberg#13654