Skip to content

Conversation

@dramaticlly
Copy link
Contributor

Instead of scanning all entries in data/manifest for identifying list of content files to copy, scan only the live one. This is essential to prevent rewrite table path to carry the files already expired as part of snapshot expiration in the source table.

Existing logic fetch both added/existing/deleted entry from manifest to collect list of content files to be copied and rely on reducer for deduplicate based on file name.

However we want to avoid the scenario where the given content file with only deleted status in older manifest, as snapshot expiration might already removed the snapshot which reference the given content file, and deleted as part of snapshot expiration.

With some concrete examlpe to help with explanation,

assume we have 3 snapshots of overwrite operation
1. 8729031490038117099 (dataSeq=1, added d2.parquet)
2. 6024975807438659167 (dataSeq=2, removed d2,parquet and added d3.parquet)
3. 4358334817990999907 (dataSeq=3, removed d3.parquet and added d4.parquet)

the expiration of first snapshot 8729031490038117099, will remove d2.parquet on disk,
second snapshot 6024975807438659167 might still have data manifest entry of deleted (status=2) for d2.parquet.

However it's not desired to include d2.parquet as part of files for path rewrite.

CC @szehon-ho

@szehon-ho
Copy link
Member

Yes, thanks for fixing the issue (found by our internal usage).

I wonder, because the deleted entry may be important for CDC (to mark that this file at some point existed), is another possibility to skip the deleted file for copy? cc @flyrain for thoughts as well

@dramaticlly dramaticlly marked this pull request as ready for review January 21, 2025 04:31
@dramaticlly
Copy link
Contributor Author

Yes, thanks for fixing the issue (found by our internal usage).

I wonder, because the deleted entry may be important for CDC (to mark that this file at some point existed), is another possibility to skip the deleted file for copy? cc @flyrain for thoughts as well

Sounds good, I updated the PR to keep the deleted entry but exclude deleted content files from copy plan

@flyrain
Copy link
Contributor

flyrain commented Jan 24, 2025

Yes, thanks for fixing the issue (found by our internal usage).
I wonder, because the deleted entry may be important for CDC (to mark that this file at some point existed), is another possibility to skip the deleted file for copy? cc @flyrain for thoughts as well

Sounds good, I updated the PR to keep the deleted entry but exclude deleted content files from copy plan

CDC is fine as it won't work anyway if the data file has been removed. +1 to keep the entry in the manifest file as this tool isn't supposed to change it.

@dramaticlly dramaticlly changed the title Core, Spark: Scan only live entries in RewriteTablePathUtil Core, Spark: emExclude non live content file in RewriteTablePathUtil Jan 25, 2025
@dramaticlly dramaticlly changed the title Core, Spark: emExclude non live content file in RewriteTablePathUtil Core, Spark: Exclude non live content file in RewriteTablePathUtil Jan 25, 2025
@dramaticlly dramaticlly deleted the rewrite-live branch January 26, 2025 04:06
@dramaticlly dramaticlly restored the rewrite-live branch January 26, 2025 04:06
@dramaticlly dramaticlly reopened this Jan 26, 2025
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dramaticlly this looks better, a few nits. Sorry for delay

@szehon-ho szehon-ho merged commit e91655b into apache:main Feb 7, 2025
46 checks passed
@szehon-ho
Copy link
Member

Merged, thanks @dramaticlly !

@dramaticlly
Copy link
Contributor Author

Thank you @szehon-ho for reviewing my change!

slfan1989 added a commit to slfan1989/iceberg that referenced this pull request Mar 23, 2025
nastra pushed a commit that referenced this pull request Mar 24, 2025
… procedure (#12006 #12172 #11929 #12282 #12569) (#12568)

* [BackPort#12006] Core: Exclude deleted content file in RewriteTablePathUtil copy plan (#12006)

* [BackPort#12172] Core: Fix RewriteTablePath Incremental Replication (#12172)

* [BackPort#11929] Spark 3.5: Support Statistics Files in RewriteTablePath (#11929)

* [BackPort#12282] Spark 3.5: Fix job description of RewriteTablePathSparkAction(#12282)

* [BackPort#12569] Spark: Improve assertions for better debuggability (#12569)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants