API: Move Variant interfaces and serialized implementations to API #12374

rdblue · 2025-02-21T20:04:41Z

This has been part of other PRs, but because the Serialized* classes are moving it is getting big enough to be a separate PR.

This moves the Variant interfaces from core to API and also moves the implementations that work with serialized variant buffers. The motivation for this move is to make it possible to read Variant buffers stored in manifest file metadata as lower and upper bounds. The InclusiveMetricsEvaluator is in API (#12311) and needs to be able to deserialize variants.

rdblue · 2025-02-21T20:05:45Z

api/src/main/java/org/apache/iceberg/variants/Serialized.java

@@ -18,11 +18,8 @@
 */
 package org.apache.iceberg.variants;

-/** A variant metadata and value pair. */
-public interface Variant {


This was moved from Variants and did not replace the Variant interface. Looks like a bad diff detection in git.

rdblue · 2025-02-21T20:07:02Z

api/src/main/java/org/apache/iceberg/variants/Variant.java

+  /** Returns the variant value. */
+  VariantValue value();
+
+  static Variant of(VariantMetadata metadata, VariantValue value) {


To be replaced with the implementation in the Parquet writer PR (#12323).

rdblue · 2025-02-21T20:07:37Z

api/src/main/java/org/apache/iceberg/variants/SerializedObject.java

@@ -133,8 +132,8 @@ public boolean hasNext() {
          }

          @Override
-          public Pair<String, Integer> next() {
-            Pair<String, Integer> next = Pair.of(metadata.get(id(index)), index);


Pair is part of core and can't be moved because it has Avro class references.

rdblue · 2025-02-21T20:08:34Z

api/src/main/java/org/apache/iceberg/variants/SerializedMetadata.java


  static final ByteBuffer EMPTY_V1_BUFFER =
-      ByteBuffer.wrap(new byte[] {0x01, 0x00}).order(ByteOrder.LITTLE_ENDIAN);
+      ByteBuffer.wrap(new byte[] {0x01, 0x00, 0x00}).order(ByteOrder.LITTLE_ENDIAN);


This implementation now finds the end of the Variant metadata buffer so that metadata and value buffers can be concatenated.

rdblue · 2025-02-21T20:09:15Z

api/src/main/java/org/apache/iceberg/variants/VariantValue.java

    throw new IllegalArgumentException("Not an array: " + this);
  }
+
+  static VariantValue from(VariantMetadata metadata, ByteBuffer value) {


Factory methods are copied into better places in the API now, rather than all being in Variants.

aihuaxu

LGTM.

aihuaxu · 2025-02-21T20:30:59Z

api/src/test/java/org/apache/iceberg/variants/TestSerializedMetadata.java

  @Test
  public void testHeaderSorted() {
-    SerializedMetadata metadata = SerializedMetadata.from(new byte[] {0b10001, 0x00});
+    SerializedMetadata metadata = SerializedMetadata.from(new byte[] {0b10001, 0x00, 0x00});


Later we probably need test helper function to create such byte arrays from strings.

We have those helpers in VariantTestUtil, but most of the tests here use hard-coded byte arrays for a couple reasons. First, I don't want to rely on equally complicated code. The most basic cases (like whether the sorted flag is set) should use values created directly from the spec. Second, it's hard to exercise the cases with generated values. For instance, testHeaderOffsetSize checks the offset size without needing to generate a metadata dictionary that has more than 65k unique values. There's a test that does this later (testThreeByteFieldIds) but that's testing the offsets and it is good to know from an independent test that the offset size is being interpreted correctly.

danielcweeks

+1 Overall. I think the one preference I would have is that if we're moving some of the type info (basic/logical/physical), I feel it would be better to move it inner enum to Variant interface so we have Variant.BasicType or Variant.LogicalType as opposed to just standalone enums.

Not a strong opinion, but I but breaking them out like this just feels disconnected.

rdblue · 2025-02-21T23:12:49Z

I think the one preference I would have is that if we're moving some of the type info (basic/logical/physical), I feel it would be better to move it inner enum to Variant interface so we have Variant.BasicType or Variant.LogicalType as opposed to just standalone enums.

I agree. The reason for this is that they can't be package-private when nested in an interface and I'm not yet sure whether we will expose the basic and logical types.

rdblue · 2025-02-21T23:13:12Z

Thanks for the reviews, @danielcweeks and @aihuaxu!

Move Variant interfaces to API.

f33dc3c

github-actions bot added API core labels Feb 21, 2025

rdblue commented Feb 21, 2025

View reviewed changes

aihuaxu approved these changes Feb 21, 2025

View reviewed changes

API: Move serialized variant classes to API for bounds.

a820db5

rdblue force-pushed the variant-move-to-api branch from 880ea5f to a820db5 Compare February 21, 2025 21:31

rdblue changed the title ~~Variant: Move interfaces and serialized implementations to API~~ API: Move Variant interfaces and serialized implementations to API Feb 21, 2025

rdblue mentioned this pull request Feb 21, 2025

API, Core: Update inclusive metrics evaluator for extract and transforms #12311

Merged

danielcweeks approved these changes Feb 21, 2025

View reviewed changes

rdblue merged commit dc1e0b2 into apache:main Feb 21, 2025
43 checks passed

rdblue mentioned this pull request Feb 21, 2025

Parquet: Implement Variant writers #12323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

API: Move Variant interfaces and serialized implementations to API #12374

API: Move Variant interfaces and serialized implementations to API #12374

Uh oh!

rdblue commented Feb 21, 2025

Uh oh!

rdblue Feb 21, 2025 •

edited

Loading

Uh oh!

rdblue Feb 21, 2025

Uh oh!

rdblue Feb 21, 2025

Uh oh!

rdblue Feb 21, 2025

Uh oh!

rdblue Feb 21, 2025

Uh oh!

aihuaxu left a comment

Uh oh!

aihuaxu Feb 21, 2025

Uh oh!

rdblue Feb 21, 2025

Uh oh!

danielcweeks left a comment

Uh oh!

rdblue commented Feb 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

rdblue commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

API: Move Variant interfaces and serialized implementations to API #12374

API: Move Variant interfaces and serialized implementations to API #12374

Uh oh!

Conversation

rdblue commented Feb 21, 2025

Uh oh!

rdblue Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

aihuaxu left a comment

Choose a reason for hiding this comment

Uh oh!

aihuaxu Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

danielcweeks left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rdblue commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue Feb 21, 2025 •

edited

Loading

rdblue commented Feb 21, 2025 •

edited

Loading