Flink: Preserve row lineage in RewriteDataFiles on compaction #14127

Guosmilesmile · 2025-09-21T01:13:56Z

This PR is split into two parts to support preserving lineage information in Flink RewriteDataFiles. It only supports RewriteDataFiles for streaming compaction.

Adds readers in Flink for _row_id and _last_updated_sequence_number.
Flink: add _row_id and _last_updated_sequence_number readers #14148
This change mainly aligns with/references Spark: Add _row_id and _last_updated_sequence_number readers #12836
When RewriteDataFiles executes rewrite tasks for a PlannedGroup, if the table is detected to support RowLineage, it rewrites the schema to add ROW_ID and LAST_UPDATED_SEQUENCE_NUMBER. It then reads the newly added ROW_ID and LAST_UPDATED_SEQUENCE_NUMBER fields and writes the lineage information into the merged DataFiles.
Flink: Preserve row lineage in RewriteDataFiles #14149

pvary · 2025-09-22T05:46:10Z

Thanks @Guosmilesmile for the PR.

A few questions:

Would it make sense to separate the read path to a different PR for faster reviews
For V3 tables, should we enforce this behavior?

CC: @mxm

Guosmilesmile · 2025-09-22T06:16:46Z

@pvary Thanks for the quick replay.

Would it make sense to separate the read path to a different PR for faster reviews

Yes, I have split this pr into two different pr. But the second part relay on the first one, so the ci in second part will fail. However, I don't think this should affect the review.

For V3 tables, should we enforce this behavior?

From my perspective, if the v3 table does not enforce this behavior, the lineage information will be lost, and users will not be aware of it. So I enforce this behavior.

pvary · 2025-09-22T07:29:32Z

From my perspective, if the v3 table does not enforce this behavior, the lineage information will be lost, and users will not be aware of it. So I enforce this behavior.

So we definitely make this mandatory for V3 tables

Guosmilesmile · 2025-09-23T13:19:25Z

I will close this pr since the code had split into two pr.

github-actions bot added the flink label Sep 21, 2025

Flink: add _row_id and _last_updated_sequence_number readers

b4b2519

Guosmilesmile force-pushed the flink_lineage branch from 13ef273 to 80c055a Compare September 23, 2025 05:02

Flink: Preserve row lineage in RewriteDataFiles on compaction

c81d036

Guosmilesmile force-pushed the flink_lineage branch from 80c055a to c81d036 Compare September 23, 2025 12:41

Guosmilesmile closed this Sep 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flink: Preserve row lineage in RewriteDataFiles on compaction #14127

Flink: Preserve row lineage in RewriteDataFiles on compaction #14127

Uh oh!

Guosmilesmile commented Sep 21, 2025 •

edited

Loading

Uh oh!

pvary commented Sep 22, 2025

Uh oh!

Guosmilesmile commented Sep 22, 2025 •

edited

Loading

Uh oh!

pvary commented Sep 22, 2025

Uh oh!

Guosmilesmile commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Flink: Preserve row lineage in RewriteDataFiles on compaction #14127

Flink: Preserve row lineage in RewriteDataFiles on compaction #14127

Uh oh!

Conversation

Guosmilesmile commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Sep 22, 2025

Uh oh!

Guosmilesmile commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Sep 22, 2025

Uh oh!

Guosmilesmile commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Guosmilesmile commented Sep 21, 2025 •

edited

Loading

Guosmilesmile commented Sep 22, 2025 •

edited

Loading