-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add test for vacuum with CDF files #15663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for vacuum with CDF files #15663
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewing last commit.
If my memory is right that checkpoint files don't contain CDC references, this is not going to work. If the most recent commit was checkpointed getJsonTransactionLogEntries is going to return an empty list and you'll miss some valid CDC files.
I would argue that what we have now is good enough, comparing the CDC file timestamp against the cutoff. It doesn't ensure that we keep the CDC files for the most recent checkpoint, but I don't think that's a big deal.
...ake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeTransactionLogEntry.java
Outdated
Show resolved
Hide resolved
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
|
@homar, is this still being actively worked on? |
Not actively, but I will work on this soon, this or next week. |
|
Sounds good :) I'll leave it alone! |
TableFunctionProcessorNode has a mapping from source symbols to marker symbols. When the plan is optimized with UnaliasSymbolReferences, some source symbols, and their corresponding marker symbols, might be recognized as semantically identical, and rewritten to the same symbols. Before this change, building the symbol-to-marker map failed in such case, because the used map collector could not resolve conflicts. After this change, duplicate entries are accepted provided that equal keys have eqal values (markers) assigned.
TableFunctionProcessorNode specifies pass-through symbols for all sources. When the plan is optimized with UnaliasSymbolReferences, some pass-through symbols from one or more sources might be recognized as semantically identical, and rewritten to the same symbol. Before this change, all the rewritten symbols remained in the pass-through lists. If duplicates occurred, it caused failure in the LocalExecutionPlanner while computing the output mapping for the TableFunctionProcessorNode.
Adds support for table functions that have table inputs. Execution by operator for table functions without sources is not yet implemented.
Before this change, if the function had KEEP WHEN EMPTY property, and all input relations were empty, the property was not respected, and the function was not executed. This change introduces EmptyTableFunctionPartition to execute the function upon when there is no input. It is ensured that the function with empty input is executed once per node, and that there is single node distribution in case of the KEEP WHEN EMPTY property.
Rename the existing TableFunctionProcessor to TableFunctionDataProcessor, and introduce another interface for processing splits.
This change adds the LeafTableFunctionOperator for processing table functions that do not take input (without table arguments). Splits are supported.
…elta lake table_changes allows reading cdf entries stream. startVersion and endVersion are optional. If startVersion is not provided, function starts reading entries from the first one. startVersion is exclusive, endVersion is inclusive.
a643df0 to
0834a92
Compare
|
@homar do you want to convert this to a Databricks compatibility test so that it doesn't depend on you other PR? |
|
I will include it in table_changes |
should this one be closed now? |
Description
Depends on #16205
Add test making sure vacuum doesn't delete cdf files that are not old enough for cleanup
Additional context and related issues
Previous implementation deleted all CDF files no matter their age.
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
() Release notes are required, with the following suggested text: