Skip to content

Conversation

@Guosmilesmile
Copy link
Contributor

@Guosmilesmile Guosmilesmile commented Sep 21, 2025

This PR is split into two parts to support preserving lineage information in Flink RewriteDataFiles. It only supports RewriteDataFiles for streaming compaction.

  1. Adds readers in Flink for _row_id and _last_updated_sequence_number.
    Flink: add _row_id and _last_updated_sequence_number readers #14148
    This change mainly aligns with/references Spark: Add _row_id and _last_updated_sequence_number readers #12836

  2. When RewriteDataFiles executes rewrite tasks for a PlannedGroup, if the table is detected to support RowLineage, it rewrites the schema to add ROW_ID and LAST_UPDATED_SEQUENCE_NUMBER. It then reads the newly added ROW_ID and LAST_UPDATED_SEQUENCE_NUMBER fields and writes the lineage information into the merged DataFiles.
    Flink: Preserve row lineage in RewriteDataFiles  #14149

@github-actions github-actions bot added the flink label Sep 21, 2025
@pvary
Copy link
Contributor

pvary commented Sep 22, 2025

Thanks @Guosmilesmile for the PR.

A few questions:

  • Would it make sense to separate the read path to a different PR for faster reviews
  • For V3 tables, should we enforce this behavior?

CC: @mxm

@Guosmilesmile
Copy link
Contributor Author

Guosmilesmile commented Sep 22, 2025

@pvary Thanks for the quick replay.

Would it make sense to separate the read path to a different PR for faster reviews

Yes, I have split this pr into two different pr. But the second part relay on the first one, so the ci in second part will fail. However, I don't think this should affect the review.

For V3 tables, should we enforce this behavior?

From my perspective, if the v3 table does not enforce this behavior, the lineage information will be lost, and users will not be aware of it. So I enforce this behavior.

@pvary
Copy link
Contributor

pvary commented Sep 22, 2025

From my perspective, if the v3 table does not enforce this behavior, the lineage information will be lost, and users will not be aware of it. So I enforce this behavior.

So we definitely make this mandatory for V3 tables

@Guosmilesmile
Copy link
Contributor Author

I will close this pr since the code had split into two pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants