Restore predicate pushdown of metadata field in CheckpointEntryIterator#19157
Restore predicate pushdown of metadata field in CheckpointEntryIterator#19157ebyhr wants to merge 1 commit intotrinodb:masterfrom
Conversation
| HiveColumnHandle metadata = columns.stream() | ||
| .filter(column -> column.getBaseColumnName().equals("metadata")) | ||
| .collect(onlyElement()); | ||
| tupleDomain = buildTupleDomainColumnHandle(METADATA, metadata); |
There was a problem hiding this comment.
Is it the case that whenever "protocol" is non-null, then "metadata" is also guaranteed to be non-null ?
I'm wondering if this is safe given that this filter can exclude rows where "metadata" is null but "protocol" is non-null.
There was a problem hiding this comment.
"metadata" is a required field and it's not nullable as far as I read Delta protocol.
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1
Each row in the checkpoint corresponds to a single action. The checkpoint must contain all information regarding the following actions:
- The protocol version
- The metadata of the table
There was a problem hiding this comment.
sounds okay to me, would be great if someone more familiar with delta spec than me would also review this
There was a problem hiding this comment.
@ebyhr the above quote doesn't actually specify that the protocol and metadata actions are in the same Parquet file row group.
AFAIU, #17408 is about retrieving only the row groups which correspond to non-null filters.
For finding protocol or metadata there are different Parquet filters used:
In the absence of composed Parquet predicate pushdown, we're probably left with reading two times the checkpoint file (although this means reading actually two row groups - which are likely the same most of the time).
Description
#18423 caused a regression about #17408
This is a short-term solution until #19156
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.