Skip to content

Conversation

@7hong
Copy link
Contributor

@7hong 7hong commented Oct 17, 2024

Description

When using the iceberg table, the xxx-metadata.json file is generated each time commit is executed. In the iceberg table, we can automatically clean the previous metadata file through configuration.

write.metadata.delete-after-commit.enabled = true
write.metadata.previous-versions-max = 10
However, the metadata file is not automatically cleaned in trino. A large number of metadata.json files are left on the hdfs.

Different from #20863 ,I have followed the configurations in iceberg (write.metadata.delete-after-commit.enabled and write.metadata.previous-version-max ). Instead of adding a new configuration, this keeps it compatible with iceberg.

Release notes

( x) Release notes are required, with the following suggested text:

## Iceberg connector

* Delete the oldest tracked version metadata files after commit. ({issue} #19582 ). 

@cla-bot cla-bot bot added the cla-signed label Oct 17, 2024
@github-actions github-actions bot added the iceberg Iceberg connector label Oct 17, 2024
@7hong 7hong changed the title Iceberg-connector: Delete the oldest tracked version metadata files a… Iceberg-connector: Delete the oldest tracked version metadata files after commit Oct 17, 2024
@mosabua mosabua removed their request for review October 18, 2024 20:13
Copy link
Member

@ebyhr ebyhr Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a new test class? Please use existing test classes as much as possible to avoid redundant bootstrap time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg-connector: Delete the oldest tracked version metadata files after commit

Please follow commit message guideline. https://cbea.ms/git-commit/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder, I see you changed a PR title. Please also update a commit title.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought the PR's title would be the title of the final merge commit. I will reopen a PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to reopen a PR. Please amend the commit title and push it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Sets.difference instead?

Comment on lines 330 to 332
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Unwrap and static import METADATA_DELETE_AFTER_COMMIT_ENABLED & METADATA_DELETE_AFTER_COMMIT_ENABLED_DEFAULT.

Comment on lines 342 to 343
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use enhanced instanceof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://trino.io/docs/current/develop/tests.html

Test methods should be defined as package-private.

Same for tearDown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet is the default format. Remove.

Comment on lines 97 to 102
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unwrap.

Comment on lines 110 to 113
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these config properties related to the test purpose? Please remove if it's unrelated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trino/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergHiveMetastoreAutoCleanMetadataFile.java

The file path is wrong. It should be plugin/trino-iceberg/..., remove a leading trino.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should verify that the oldest metadata file was exactly removed instead of just checking the count.

@7hong 7hong changed the title Iceberg-connector: Delete the oldest tracked version metadata files after commit Delete the oldest tracked version metadata files after commit Oct 24, 2024
@7hong
Copy link
Contributor Author

7hong commented Oct 24, 2024

Thank you very much for your patient review. According to your suggestions, I have completed the revisions. Please review it again. Thank you very much. @ebyhr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder, I see you changed a PR title. Please also update a commit title.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to put this constant in fields as this is used only from testInsertWithAutoCleanMetadataFile. Please change to a local variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test class is for verifying file operation counts. Please move to TestIcebergV2 which has loadTable method or somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please assert metadata files in this loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertJ provides hasSize method.

Comment on lines 335 to 336
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unwrap.

Comment on lines 329 to 332
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unwrap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename f to file.

Jan Waś and others added 22 commits October 28, 2024 17:56
Single file without a delete in a partition can't be
optimized any further
When a memory input file is created for avro, rc, and line readers, we
need to update the length that will be passed in to the reader since the
length of the memory input file can possibly be less than the original
input file length.
While HTTP/2 was supported before, now it is used by default with
CLI as well as the JDBC driver.
Previous logic was setting all columns as non-nullable
which is not valid when the data contains nulls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

8 participants