-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API: Move Variant interfaces and serialized implementations to API #12374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -18,11 +18,8 @@ | |||
| */ | |||
| package org.apache.iceberg.variants; | |||
|
|
|||
| /** A variant metadata and value pair. */ | |||
| public interface Variant { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved from Variants and did not replace the Variant interface. Looks like a bad diff detection in git.
| /** Returns the variant value. */ | ||
| VariantValue value(); | ||
|
|
||
| static Variant of(VariantMetadata metadata, VariantValue value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be replaced with the implementation in the Parquet writer PR (#12323).
| @@ -133,8 +132,8 @@ public boolean hasNext() { | |||
| } | |||
|
|
|||
| @Override | |||
| public Pair<String, Integer> next() { | |||
| Pair<String, Integer> next = Pair.of(metadata.get(id(index)), index); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pair is part of core and can't be moved because it has Avro class references.
|
|
||
| static final ByteBuffer EMPTY_V1_BUFFER = | ||
| ByteBuffer.wrap(new byte[] {0x01, 0x00}).order(ByteOrder.LITTLE_ENDIAN); | ||
| ByteBuffer.wrap(new byte[] {0x01, 0x00, 0x00}).order(ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation now finds the end of the Variant metadata buffer so that metadata and value buffers can be concatenated.
| throw new IllegalArgumentException("Not an array: " + this); | ||
| } | ||
|
|
||
| static VariantValue from(VariantMetadata metadata, ByteBuffer value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factory methods are copied into better places in the API now, rather than all being in Variants.
aihuaxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| @Test | ||
| public void testHeaderSorted() { | ||
| SerializedMetadata metadata = SerializedMetadata.from(new byte[] {0b10001, 0x00}); | ||
| SerializedMetadata metadata = SerializedMetadata.from(new byte[] {0b10001, 0x00, 0x00}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later we probably need test helper function to create such byte arrays from strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have those helpers in VariantTestUtil, but most of the tests here use hard-coded byte arrays for a couple reasons. First, I don't want to rely on equally complicated code. The most basic cases (like whether the sorted flag is set) should use values created directly from the spec. Second, it's hard to exercise the cases with generated values. For instance, testHeaderOffsetSize checks the offset size without needing to generate a metadata dictionary that has more than 65k unique values. There's a test that does this later (testThreeByteFieldIds) but that's testing the offsets and it is good to know from an independent test that the offset size is being interpreted correctly.
880ea5f to
a820db5
Compare
danielcweeks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Overall. I think the one preference I would have is that if we're moving some of the type info (basic/logical/physical), I feel it would be better to move it inner enum to Variant interface so we have Variant.BasicType or Variant.LogicalType as opposed to just standalone enums.
Not a strong opinion, but I but breaking them out like this just feels disconnected.
I agree. The reason for this is that they can't be package-private when nested in an interface and I'm not yet sure whether we will expose the basic and logical types. |
|
Thanks for the reviews, @danielcweeks and @aihuaxu! |
This has been part of other PRs, but because the
Serialized*classes are moving it is getting big enough to be a separate PR.This moves the Variant interfaces from core to API and also moves the implementations that work with serialized variant buffers. The motivation for this move is to make it possible to read Variant buffers stored in manifest file metadata as lower and upper bounds. The
InclusiveMetricsEvaluatoris in API (#12311) and needs to be able to deserialize variants.