Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Feb 21, 2025

This has been part of other PRs, but because the Serialized* classes are moving it is getting big enough to be a separate PR.

This moves the Variant interfaces from core to API and also moves the implementations that work with serialized variant buffers. The motivation for this move is to make it possible to read Variant buffers stored in manifest file metadata as lower and upper bounds. The InclusiveMetricsEvaluator is in API (#12311) and needs to be able to deserialize variants.

@@ -18,11 +18,8 @@
*/
package org.apache.iceberg.variants;

/** A variant metadata and value pair. */
public interface Variant {
Copy link
Contributor Author

@rdblue rdblue Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved from Variants and did not replace the Variant interface. Looks like a bad diff detection in git.

/** Returns the variant value. */
VariantValue value();

static Variant of(VariantMetadata metadata, VariantValue value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be replaced with the implementation in the Parquet writer PR (#12323).

@@ -133,8 +132,8 @@ public boolean hasNext() {
}

@Override
public Pair<String, Integer> next() {
Pair<String, Integer> next = Pair.of(metadata.get(id(index)), index);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pair is part of core and can't be moved because it has Avro class references.


static final ByteBuffer EMPTY_V1_BUFFER =
ByteBuffer.wrap(new byte[] {0x01, 0x00}).order(ByteOrder.LITTLE_ENDIAN);
ByteBuffer.wrap(new byte[] {0x01, 0x00, 0x00}).order(ByteOrder.LITTLE_ENDIAN);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation now finds the end of the Variant metadata buffer so that metadata and value buffers can be concatenated.

throw new IllegalArgumentException("Not an array: " + this);
}

static VariantValue from(VariantMetadata metadata, ByteBuffer value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factory methods are copied into better places in the API now, rather than all being in Variants.

Copy link
Contributor

@aihuaxu aihuaxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Test
public void testHeaderSorted() {
SerializedMetadata metadata = SerializedMetadata.from(new byte[] {0b10001, 0x00});
SerializedMetadata metadata = SerializedMetadata.from(new byte[] {0b10001, 0x00, 0x00});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later we probably need test helper function to create such byte arrays from strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have those helpers in VariantTestUtil, but most of the tests here use hard-coded byte arrays for a couple reasons. First, I don't want to rely on equally complicated code. The most basic cases (like whether the sorted flag is set) should use values created directly from the spec. Second, it's hard to exercise the cases with generated values. For instance, testHeaderOffsetSize checks the offset size without needing to generate a metadata dictionary that has more than 65k unique values. There's a test that does this later (testThreeByteFieldIds) but that's testing the offsets and it is good to know from an independent test that the offset size is being interpreted correctly.

@rdblue rdblue force-pushed the variant-move-to-api branch from 880ea5f to a820db5 Compare February 21, 2025 21:31
@rdblue rdblue changed the title Variant: Move interfaces and serialized implementations to API API: Move Variant interfaces and serialized implementations to API Feb 21, 2025
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Overall. I think the one preference I would have is that if we're moving some of the type info (basic/logical/physical), I feel it would be better to move it inner enum to Variant interface so we have Variant.BasicType or Variant.LogicalType as opposed to just standalone enums.

Not a strong opinion, but I but breaking them out like this just feels disconnected.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 21, 2025

I think the one preference I would have is that if we're moving some of the type info (basic/logical/physical), I feel it would be better to move it inner enum to Variant interface so we have Variant.BasicType or Variant.LogicalType as opposed to just standalone enums.

I agree. The reason for this is that they can't be package-private when nested in an interface and I'm not yet sure whether we will expose the basic and logical types.

@rdblue rdblue merged commit dc1e0b2 into apache:main Feb 21, 2025
43 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Feb 21, 2025

Thanks for the reviews, @danielcweeks and @aihuaxu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants