Skip to content

Ensure Delta Lake vacuum procedure runs on supported table versions#14579

Closed
alexjo2144 wants to merge 1 commit intotrinodb:masterfrom
alexjo2144:delta/vacuum-version
Closed

Ensure Delta Lake vacuum procedure runs on supported table versions#14579
alexjo2144 wants to merge 1 commit intotrinodb:masterfrom
alexjo2144:delta/vacuum-version

Conversation

@alexjo2144
Copy link
Copy Markdown
Member

@alexjo2144 alexjo2144 commented Oct 11, 2022

Description

Add version checks to the Delta Lake vacuum procedure. Version 4 of the Delta Lake writer specification adds new metadata files created by the change data feed, this prevents vacuuming v4 tables until the procedure is updated to account for those files.

Relates to: #12637

Non-technical explanation

Add version checks to avoid corrupting newer Delta Lake tables.

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Oct 11, 2022
@alexjo2144 alexjo2144 requested review from ebyhr and homar October 11, 2022 19:54
Comment on lines +72 to +73
private static final int MAX_SUPPORTED_WRITER_VERSION = 3;
private static final int MAX_SUPPORTED_READER_VERSION = 2;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made these separate because I think we can advance them faster than we advance support for updates/deletes. As long as newer versions don't include new types of metadata files we can vacuum them.

import static java.lang.String.format;
import static org.assertj.core.api.Assertions.assertThatThrownBy;

public class TestDeltaLakeVacuumCompatibility
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes requested. I feel it would better to have vacuum execution between both Trino and Delta if we call "compatibility" tests.

@Test(groups = {DELTA_LAKE_DATABRICKS, DELTA_LAKE_OSS, PROFILE_SPECIFIC_TESTS})
public void testVacuumOnUnsupportedTableVersion()
{
String tableName = "test_dl_create_table_compat_" + randomTableSuffix();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
String tableName = "test_dl_create_table_compat_" + randomTableSuffix();
String tableName = "test_dl_unsupported_vacuum_" + randomTableSuffix();

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Oct 12, 2022

Version 4 of the Delta Lake writer specification adds new metadata files created by the change data feed, this prevents vacuuming v4 tables until the procedure is updated to account for those files.

https://docs.delta.io/2.1.0/delta-change-data-feed.html#change-data-storage mentions the below and it's the current behavior as far as I tested in #14590. What's preventing vacuuming v4 tables?

The files in the _change_data folder follow the retention policy of the table. Therefore, if you run the VACUUM command, change data feed data is also deleted.

@alexjo2144
Copy link
Copy Markdown
Member Author

After doing some testing with the Databricks implementation, I don't think we actually need to make this any smarter than it is for v4 tables. I thought there would be some logic needed to make sure that the CDF files for retained versions were also left in place but Databricks doesn't seem to have implemented it this way either.

Closing this issue, we can have a separate one for adding compatibility tests to ensure we have the same treatment of CDF files as Databricks does.

@alexjo2144 alexjo2144 closed this Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants