Flink: add _row_id and _last_updated_sequence_number readers #14148

Guosmilesmile · 2025-09-22T06:09:34Z

This pr is aim to adds readers in Flink for _row_id and _last_updated_sequence_number to support lineage in flink.
This change mainly aligns with/references #12836

Guosmilesmile · 2025-09-22T06:18:55Z

mark ci fail . https://github.com/apache/iceberg/actions/runs/17906247731/job/50907838966?pr=14148

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.
Execution failed for task ':iceberg-aws:checkClassUniqueness'.
> Could not resolve all artifacts for configuration ':iceberg-aws:runtimeClasspath'.
   > Could not resolve dev.failsafe:failsafe:3.3.2.
     Required by:
         project :iceberg-aws
      > Could not resolve dev.failsafe:failsafe:3.3.2.
         > Could not get resource 'https://repo.maven.apache.org/maven2/dev/failsafe/failsafe/3.3.2/failsafe-3.3.2.pom'.
            > Could not GET 'https://repo.maven.apache.org/maven2/dev/failsafe/failsafe/3.3.2/failsafe-3.3.2.pom'.
               > Got socket exception during request. It might be caused by SSL misconfiguration
                  > Connection reset

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java

pvary · 2025-09-22T07:27:03Z

Only some minor comments.

@mxm: Could you please take a look?

mxm · 2025-09-22T08:39:07Z

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java


+    OutputFile output = new InMemoryOutputFile();
    try (FileAppender<Record> writer =
-        Parquet.write(Files.localOutput(testFile))


What's the reason for moving from local file to in memory?

In #12836, spark and Core have changed their behavior, so I aligned this part.

mxm · 2025-09-22T08:52:07Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

-        ParquetValueReader<?> reader = readersById.get(id);
-        if (idToConstant.containsKey(id)) {
-          // containsKey is used because the constant may be null
-          int fieldMaxDefinitionLevel =
-              maxDefinitionLevelsById.getOrDefault(id, defaultMaxDefinitionLevel);
-          reorderedFields.add(
-              ParquetValueReaders.constant(idToConstant.get(id), fieldMaxDefinitionLevel));
-        } else if (id == MetadataColumns.ROW_POSITION.fieldId()) {
-          reorderedFields.add(ParquetValueReaders.position());
-        } else if (id == MetadataColumns.IS_DELETED.fieldId()) {
-          reorderedFields.add(ParquetValueReaders.constant(false));
-        } else if (reader != null) {
-          reorderedFields.add(reader);
-        } else if (field.initialDefault() != null) {
-          reorderedFields.add(
-              ParquetValueReaders.constant(
-                  RowDataUtil.convertConstant(field.type(), field.initialDefault()),
-                  maxDefinitionLevelsById.getOrDefault(id, defaultMaxDefinitionLevel)));
-        } else if (field.isOptional()) {
-          reorderedFields.add(ParquetValueReaders.nulls());
-        } else {
-          throw new IllegalArgumentException(
-              String.format("Missing required field: %s", field.name()));
-        }
+        ParquetValueReader<?> reader =
+            ParquetValueReaders.replaceWithMetadataReader(
+                id, readersById.get(id), idToConstant, constantDefinitionLevel);
+        reorderedFields.add(defaultReader(field, reader, constantDefinitionLevel));


Just curious, how did you decide to make this refactoring? You consolidate several branches of the if statement.

In #12836, the core layer integrated the same code between Spark and Core. I noticed that the code for Flink is also the same, so I reused it.

mxm · 2025-09-22T09:08:20Z

flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

 import org.apache.flink.table.data.GenericRowData;
 import org.apache.flink.table.data.MapData;
 import org.apache.flink.table.data.RawValueData;
 import org.apache.flink.table.data.RowData;


The PR needs to go against the 2.1 target (currently: 2.0), as #13714 has been merged.

Can we focus on the 2.0.0 version for this PR first, and then backport it to 2.1.0 and 1.20 later?

I'm fine with both.

Guosmilesmile · 2025-09-22T09:58:59Z

Mark ci fail . https://github.com/apache/iceberg/actions/runs/17911586761/job/50924149718?pr=14148
Error: Exception in thread "main" java.net.SocketException: Connection reset at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:337) at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:364) at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:845) at java.base/java.net.Socket$SocketInputStream.read(Socket.java:978)

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java

pvary · 2025-09-23T09:40:35Z

@mxm: Any comments?

mxm

LGTM!

pvary · 2025-09-23T13:12:12Z

Merged to main.
Thanks for adding the columns @Guosmilesmile and @mxm for the review!

…14148)

github-actions bot added the flink label Sep 22, 2025

This was referenced Sep 22, 2025

Flink: Preserve row lineage in RewriteDataFiles on compaction #14127

Closed

Flink: Preserve row lineage in RewriteDataFiles #14149

Merged

Flink: add _row_id and _last_updated_sequence_number readers

a0372b4

Guosmilesmile force-pushed the flink_lineage_reader branch from 5c21839 to a0372b4 Compare September 22, 2025 06:21

Guosmilesmile closed this Sep 22, 2025

Guosmilesmile reopened this Sep 22, 2025

Guosmilesmile closed this Sep 22, 2025

Guosmilesmile reopened this Sep 22, 2025

pvary reviewed Sep 22, 2025

View reviewed changes

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java Outdated Show resolved Hide resolved

pvary reviewed Sep 22, 2025

View reviewed changes

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java Outdated Show resolved Hide resolved

pvary reviewed Sep 22, 2025

View reviewed changes

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java Outdated Show resolved Hide resolved

pvary reviewed Sep 22, 2025

View reviewed changes

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java Show resolved Hide resolved

mxm reviewed Sep 22, 2025

View reviewed changes

Address Peter's Comments

01c8251

Guosmilesmile closed this Sep 22, 2025

Guosmilesmile reopened this Sep 22, 2025

pvary reviewed Sep 22, 2025

View reviewed changes

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java Outdated Show resolved Hide resolved

Guosmilesmile closed this Sep 22, 2025

Guosmilesmile reopened this Sep 22, 2025

Guosmilesmile closed this Sep 22, 2025

Guosmilesmile reopened this Sep 22, 2025

Change pos

8cf2f8e

Guosmilesmile force-pushed the flink_lineage_reader branch from e4bf06b to 8cf2f8e Compare September 22, 2025 14:09

pvary reviewed Sep 22, 2025

View reviewed changes

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java Outdated Show resolved Hide resolved

Remove Nullable

6356e37

Guosmilesmile force-pushed the flink_lineage_reader branch from 550ef7b to 6356e37 Compare September 23, 2025 02:14

pvary reviewed Sep 23, 2025

View reviewed changes

flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java Outdated Show resolved Hide resolved

pvary approved these changes Sep 23, 2025

View reviewed changes

Address Comment

d6a877c

Guosmilesmile force-pushed the flink_lineage_reader branch from bba000f to d6a877c Compare September 23, 2025 09:17

mxm approved these changes Sep 23, 2025

View reviewed changes

pvary merged commit 6829c3e into apache:main Sep 23, 2025
18 checks passed

Guosmilesmile deleted the flink_lineage_reader branch September 23, 2025 13:16

Guosmilesmile mentioned this pull request Sep 23, 2025

Flink: Backport add _row_id and _last_updated_sequence_number readers to 2.1 and 1.20 #14168

Merged

gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025

Flink: add _row_id and _last_updated_sequence_number readers (apache#…

f085a4e

…14148)

Flink: add _row_id and _last_updated_sequence_number readers #14148

Flink: add _row_id and _last_updated_sequence_number readers #14148

Conversation

Guosmilesmile commented Sep 22, 2025

Uh oh!

Guosmilesmile commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented Sep 22, 2025

Uh oh!

mxm Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

mxm Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

mxm Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented Sep 23, 2025

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Guosmilesmile commented Sep 22, 2025 •

edited

Loading