Store extended statistics filename in Delta Lake CommitInfoEntry#29099
Open
chenjian2664 wants to merge 10 commits intotrinodb:masterfrom
Open
Store extended statistics filename in Delta Lake CommitInfoEntry#29099chenjian2664 wants to merge 10 commits intotrinodb:masterfrom
CommitInfoEntry#29099chenjian2664 wants to merge 10 commits intotrinodb:masterfrom
Conversation
41553dc to
dca45cc
Compare
dca45cc to
f617dae
Compare
findinpath
reviewed
Apr 14, 2026
findinpath
reviewed
Apr 14, 2026
findinpath
reviewed
Apr 14, 2026
| @@ -212,7 +213,7 @@ public TableStatistics getTableStatistics(ConnectorSession session, DeltaLakeTab | |||
|
|
|||
| Optional<ExtendedStatistics> statistics = Optional.empty(); | |||
| if (isExtendedStatisticsEnabled(session)) { | |||
Contributor
There was a problem hiding this comment.
i'm missing integration test coverage for this new development
we should cover situations where a commit does extended stats writing and does not write extended stats, but they are retrieved nevertheless.
also vacuuming doesn't seem to be covered in the integration tests.
Contributor
Author
There was a problem hiding this comment.
Could you please elaborate more about "commit extended stats but not write"?
7565fc9 to
fa1d083
Compare
Simplify the code by adding a Builder to MetadataAndProtocolEntries to incrementally collect `MetadataEntry` and `ProtocolEntry`.
Avoid multiple traversals of transaction log entries by using `getMetadataAndProtocolEntry` to retrieve both metadata and protocol entries in a single pass, instead of fetching them individually and scanning the log multiple times.
Previously, extended statistics were always written to a fixed path (`_delta_log/_trino_meta/extended_stats.json`), which made updates non-atomic with respect to the transaction log: a crash between writing the stats file and committing the transaction could leave stale or inconsistent statistics visible to readers. This change records the extended statistics filename in the `commitInfo`, when updating extended stats, we first write the statistics file, then leverage the commit log mechanism to handle concurrent conflicts.
fa1d083 to
1857e83
Compare
Contributor
Author
|
@findepi Would you mind to take a look when you are available :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This change records the extended statistics filename in the
commitInfo, when updating extended stats, we first write the statistics file, then leverage the commit log mechanism to handle concurrent conflicts - in other words it will allow the concurrent read/write if the log synchronizer support it.This contribution ensures that there are unique stats files for each transaction log file (when the writing of the extended stats is enabled).
It is a fix for the existing flaky tests:
This is an alternative to #29006, but requires extensive modifications to the current implementation.
Design
According to the Delta Lake protocol specification:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#commit-provenance-information
The commitInfo action is therefore the most appropriate place to persist additional metadata during a commit.
In Trino, the
CommitInfoEntryis written for every data modification operation. This makes it a natural integration point for recording extended statistics. In contrast,MetadataEntryandProtocolEntryare not suitable for this purpose, as they are intended for schema or protocol changes and should not be rewritten on every commit solely to update statistics.The write flow for extended statistics follows the same pattern as data file writes:
CommitInfoEntry.Review
For reviewers:
CommitInfoviaDeltaLakeMetadata#getTableHandleand cache it withinTableSnapshot.CommitInfo.Limitation
Concurrency guarantees
Conflict resolution is bounded by the capabilities of the underlying log synchronizer. In multi-cluster (e.g., cloud) deployments, this support is limited and depends on the guarantees provided by the underlying storage system.
Engine-specific visibility
Extended statistics are Trino-specific. When a table is written by other engines, these statistics will not be updated or recognized, which may lead to stale statistics being used by Trino.
Fixes:
testConcurrentInsertsSelectingFromTheSameVersionedTableandtestConcurrentInsertsSelectingFromTheSameTemporalVersionedTableinTestDeltaLakeLocalConcurrentWritesTest#21725TestDeltaLakeLocalConcurrentWritesTest.testConcurrentInsertsReconciliationForMixedInsertsflaky:Failed to decode JSONfrom statistics access #22455Fixes #21725
Fixes #22455
Fixes #12032
Fixes #16088
Fixes #29064
Additional context and related issues
allowFailureflag to statistics access #29006INSERTin Delta Lake #18506 the flag was introducedRelease notes
(x) This is not user-visible or is docs only, and no release notes are required.