Skip to content

Don't truncate min/max for iceberg delete files#12671

Merged
findepi merged 1 commit intotrinodb:masterfrom
homar:homar/cleanup_unused_delete_files_during_iceberg_optimize
Jun 8, 2022
Merged

Don't truncate min/max for iceberg delete files#12671
findepi merged 1 commit intotrinodb:masterfrom
homar:homar/cleanup_unused_delete_files_during_iceberg_optimize

Conversation

@homar
Copy link
Copy Markdown
Member

@homar homar commented Jun 2, 2022

Description

Previously min/max were truncated to very short numbers which may have had very bad impact on performance.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jun 2, 2022
@homar
Copy link
Copy Markdown
Member Author

homar commented Jun 2, 2022

I used metric FULL_METRIC_CONFIG for parquet to allow storing full paths in min/max. It was impossible to do it this way for ORC so I decided to put hard limit there.
Does it make sense to do it this way or should I also make hard limit on length for parquet to have it consistent ?
@alexjo2144 @findepi

@homar homar requested review from alexjo2144 and findepi and removed request for findepi June 2, 2022 21:04
@homar homar changed the title Homar/cleanup unused delete files during iceberg optimize Cleanup unused delete files during iceberg optimize Jun 2, 2022
@homar homar changed the title Cleanup unused delete files during iceberg optimize Don't truncate min/max for iceberg delete files Jun 3, 2022
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never want to truncate the stats path, right? So could we do

DataSize.ofBytes(Long.MAX_VALUE) here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that makes sense, I will change it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it has to be Integer.MAX_VALUE

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe cleaner to not have the stats limit be optional in createOrcWriter and instead get it from the session here, so we always have a value

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a GH Issue to support metrics mode for Parquet? If not, can you make one

Either way, link it here

@homar homar force-pushed the homar/cleanup_unused_delete_files_during_iceberg_optimize branch 2 times, most recently from 6168f68 to c242315 Compare June 4, 2022 07:23
@homar homar force-pushed the homar/cleanup_unused_delete_files_during_iceberg_optimize branch from c242315 to 582a1b1 Compare June 4, 2022 09:09
@findepi findepi requested a review from alexjo2144 June 4, 2022 19:11
@findepi findepi merged commit b385185 into trinodb:master Jun 8, 2022
@findepi findepi mentioned this pull request Jun 8, 2022
@github-actions github-actions bot added this to the 385 milestone Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants