Skip to content

Add test for vacuum procedure with CDF table in Delta#14590

Merged
ebyhr merged 1 commit intomasterfrom
ebi/delta-vacuum-cdf
Oct 27, 2022
Merged

Add test for vacuum procedure with CDF table in Delta#14590
ebyhr merged 1 commit intomasterfrom
ebi/delta-vacuum-cdf

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Oct 12, 2022

Description

Relates to #12637

Release notes

(x) This is not user-visible or docs only and no release notes are required.

@ebyhr ebyhr added the no-release-notes This pull request does not require release notes entry label Oct 12, 2022
@ebyhr ebyhr requested review from alexjo2144 and homar October 12, 2022 12:12
Comment on lines +327 to +348
// Vacuum procedure should remove files in _change_data directory
// https://docs.delta.io/2.1.0/delta-change-data-feed.html#change-data-storage
onTrino().executeQuery("SET SESSION delta.vacuum_min_retention = '0s'");
onTrino().executeQuery("CALL delta.system.vacuum('default', '" + tableName + "', '0s')");

Assertions.assertThat(s3.listObjectsV2(bucketName, changeDataPrefix).getObjectSummaries()).hasSize(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should or shouldn't remove?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that CDF files which are referenced from transaction log versions which are retained should also be retained.

So in this case, we vacuum everything except for the most recent version of the table, there should still be one CDF file left after the vacuum.

Does that seem like the right behavior?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could see what databricks is doing and do the same thing? Or have you checked it already?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that CDF files which are referenced from transaction log versions which are retained should also be retained.

I just tried this out on Databricks 10.4 and this is not what they do. If you have the retention time set to zero they delete all of the CDF files.

I guess this is fine then

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or have you checked it already?

Yes, I confirmed Databricks behavior before sending this PR. Added another test which runs on Databricks just in case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried this out on Databricks 10.4 and this is not what they do. If you have the retention time set to zero they delete all of the CDF files.

😮
so they treat CDF files just like any other untracked files?
@vkorukanti can you please confirm?

@ebyhr ebyhr force-pushed the ebi/delta-vacuum-cdf branch from e4faeb3 to fd7dcaa Compare October 12, 2022 22:41
@cla-bot cla-bot bot added the cla-signed label Oct 12, 2022
@ebyhr ebyhr self-assigned this Oct 12, 2022
@ebyhr ebyhr mentioned this pull request Oct 14, 2022
6 tasks
@ebyhr ebyhr force-pushed the ebi/delta-vacuum-cdf branch from c9bd3c8 to 4df2713 Compare October 26, 2022 01:06
@ebyhr ebyhr force-pushed the ebi/delta-vacuum-cdf branch from 4df2713 to f124b39 Compare October 26, 2022 03:28
@ebyhr ebyhr requested a review from findepi October 26, 2022 06:38
@ebyhr ebyhr merged commit 8867def into master Oct 27, 2022
@ebyhr ebyhr deleted the ebi/delta-vacuum-cdf branch October 27, 2022 03:06
@github-actions github-actions bot added this to the 402 milestone Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed no-release-notes This pull request does not require release notes entry

Development

Successfully merging this pull request may close these issues.

5 participants