Skip to content

Conversation

@homar
Copy link
Member

@homar homar commented Jan 10, 2023

Description

Depends on #16205
Add test making sure vacuum doesn't delete cdf files that are not old enough for cleanup

Additional context and related issues

Previous implementation deleted all CDF files no matter their age.

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
() Release notes are required, with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Jan 10, 2023
@homar homar changed the title Homar/support cdf in vacuum in delta lake Modify vacuum so it only deletes cdf file older then threshold Jan 11, 2023
@homar homar marked this pull request as ready for review January 11, 2023 09:26
Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewing last commit.

If my memory is right that checkpoint files don't contain CDC references, this is not going to work. If the most recent commit was checkpointed getJsonTransactionLogEntries is going to return an empty list and you'll miss some valid CDC files.

I would argue that what we have now is good enough, comparing the CDC file timestamp against the cutoff. It doesn't ensure that we keep the CDC files for the most recent checkpoint, but I don't think that's a big deal.

@github-actions
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 20, 2023
@bitsondatadev
Copy link
Member

@homar, is this still being actively worked on?

@homar
Copy link
Member Author

homar commented Feb 21, 2023

@homar, is this still being actively worked on?

Not actively, but I will work on this soon, this or next week.

@bitsondatadev
Copy link
Member

Sounds good :) I'll leave it alone!

@github-actions github-actions bot removed the stale label Feb 21, 2023
kasiafi and others added 11 commits February 27, 2023 11:09
TableFunctionProcessorNode has a mapping from source symbols
to marker symbols. When the plan is optimized with UnaliasSymbolReferences,
some source symbols, and their corresponding marker symbols,
might be recognized as semantically identical, and rewritten
to the same symbols.
Before this change, building the symbol-to-marker map failed
in such case, because the used map collector could not resolve
conflicts.
After this change, duplicate entries are accepted provided that
equal keys have eqal values (markers) assigned.
TableFunctionProcessorNode specifies pass-through symbols for all sources.
When the plan is optimized with UnaliasSymbolReferences, some pass-through
symbols from one or more sources might be recognized as semantically identical,
and rewritten to the same symbol.
Before this change, all the rewritten symbols remained in the pass-through lists.
If duplicates occurred, it caused failure in the LocalExecutionPlanner while
computing the output mapping for the TableFunctionProcessorNode.
Adds support for table functions that have table inputs.
Execution by operator for table functions without sources
is not yet implemented.
Before this change, if the function had KEEP WHEN EMPTY property,
and all input relations were empty, the property was not respected,
and the function was not executed.

This change introduces EmptyTableFunctionPartition to execute the
function upon when there is no input.
It is ensured that the function with empty input is executed once per node,
and that there is single node distribution in case of the KEEP WHEN EMPTY
property.
Rename the existing TableFunctionProcessor to TableFunctionDataProcessor,
and introduce another interface for processing splits.
This change adds the LeafTableFunctionOperator for processing
table functions that do not take input (without table arguments).
Splits are supported.
…elta lake

table_changes allows reading cdf entries stream.
startVersion and endVersion are optional.
If startVersion is not provided, function starts reading entries from the first one.
startVersion is exclusive, endVersion is inclusive.
@homar homar force-pushed the homar/support_cdf_in_vacuum_in_delta_lake branch from a643df0 to 0834a92 Compare February 27, 2023 17:28
@homar homar changed the title Modify vacuum so it only deletes cdf file older then threshold Add test for vacuum with CDF files Feb 27, 2023
@github-actions github-actions bot added the docs label Feb 27, 2023
@homar homar requested a review from alexjo2144 February 27, 2023 21:07
@alexjo2144
Copy link
Member

@homar do you want to convert this to a Databricks compatibility test so that it doesn't depend on you other PR?
Or just include it in the table_changes PR?

@homar
Copy link
Member Author

homar commented Mar 7, 2023

I will include it in table_changes

@findepi
Copy link
Member

findepi commented Mar 27, 2023

I will include it in table_changes

should this one be closed now?

@homar homar closed this Mar 28, 2023
@homar homar deleted the homar/support_cdf_in_vacuum_in_delta_lake branch March 28, 2023 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants