Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate delete file statistics for DLO strategy #287

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Feb 15, 2025

Summary

problem: delete files are accumulating across tables due to trino default write behavior for UPDATE/MERGE/DELET statements

solution:

  1. (this PR) - record delete file count/size per table and persist to table ✅
  2. let the DLO strategy execute for many days to accumulate data 🕐
  3. analyze the optimization algorithm and update the model to discover and act on tables who would benefit most from compacting their delete files 🕐
  4. (future PR) - codify the algorithm updates into DLO strategy 🕐

outcome: databases accumulating delete files will be discovered and compacted appropriately to prevent degraded query impact

bonus content ✨ :

I noticed there is a gap in DLO testing because manual validation must be done to verify the final form and contents of the DLO table, e.g. #284

so:

  1. will a functional test into IntegrationTest by moving the DLO output table logic into StrategiesDAOInternal with save/load methods, and validated DLO finishes and outputs a table with contents and shape as expected 🕐
  2. will add an acceptance test in internal app to validate schema is evolved correctly on the actual tables 🕐

this will help automate the testing for future changes and prevent bad changes from occuring due to validation being manually done

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

@cbb330 cbb330 force-pushed the main branch 3 times, most recently from 1898935 to 5867db2 Compare February 15, 2025 22:51
Copy link
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Christian. Looks good! Left a few questions

Comment on lines +90 to +93
strategy.getPosDeleteFileCount(),
strategy.getEqDeleteFileCount(),
strategy.getPosDeleteFileBytes(),
strategy.getEqDeleteFileBytes()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be a nested struct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these values are related, but a con is that the a struct would make ser/de have more steps.

is there another pro?

+ "PARTITIONED BY (days(timestamp))",
outputFqtn));
}
// TODO: is StrategiesDaoTableProps supposed to be here?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiang95-dev did we forget to call StrategiesDaoTableProps.save in the partitioned table condition here? or was it intentional

@@ -0,0 +1,131 @@
package com.linkedin.openhouse.datalayout.persistence;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do DAO refactor into a separate PR and not mix with the delete file stats PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact let's do the delete file stats PR first, and then do DAO refactor, will need to think more about it and will take longer for that to land.

Copy link
Collaborator Author

@cbb330 cbb330 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will take this out of this PR,

will need to think more about it and will take longer for that to land.

I'd like to follow up with the refactor shortly after this PR deploys and works. whats the concern? the refactor is intended to not change behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major concern.

Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce scope of this PR to not include DAO refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants