Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
import static com.google.common.collect.ImmutableList.toImmutableList;
import static com.google.common.collect.ImmutableSet.toImmutableSet;
import static com.google.common.collect.Iterables.getOnlyElement;
import static com.google.common.collect.MoreCollectors.onlyElement;
import static io.trino.plugin.deltalake.DeltaLakeColumnType.REGULAR;
import static io.trino.plugin.deltalake.DeltaLakeErrorCode.DELTA_LAKE_BAD_DATA;
import static io.trino.plugin.deltalake.DeltaLakeErrorCode.DELTA_LAKE_INVALID_SCHEMA;
Expand Down Expand Up @@ -182,9 +183,20 @@ public CheckpointEntryIterator(
.map(field -> buildColumnHandle(field, checkpointSchemaManager, this.metadataEntry, this.protocolEntry).toHiveColumnHandle())
.collect(toImmutableList());

TupleDomain<HiveColumnHandle> tupleDomain = columns.size() > 1 ?
TupleDomain.all() :
buildTupleDomainColumnHandle(getOnlyElement(fields), getOnlyElement(columns));
TupleDomain<HiveColumnHandle> tupleDomain;
if (columns.size() == 1) {
tupleDomain = buildTupleDomainColumnHandle(getOnlyElement(fields), getOnlyElement(columns));
}
else if (columns.size() == 2 && fields.contains(METADATA) && fields.contains(PROTOCOL)) {
// TODO https://github.com/trinodb/trino/issues/19156 Add support for predicate pushdown for both metadata and protocol in CheckpointEntryIterator
HiveColumnHandle metadata = columns.stream()
.filter(column -> column.getBaseColumnName().equals("metadata"))
.collect(onlyElement());
tupleDomain = buildTupleDomainColumnHandle(METADATA, metadata);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that whenever "protocol" is non-null, then "metadata" is also guaranteed to be non-null ?
I'm wondering if this is safe given that this filter can exclude rows where "metadata" is null but "protocol" is non-null.

Copy link
Copy Markdown
Member Author

@ebyhr ebyhr Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"metadata" is a required field and it's not nullable as far as I read Delta protocol.

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1

Each row in the checkpoint corresponds to a single action. The checkpoint must contain all information regarding the following actions:

  • The protocol version
  • The metadata of the table

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds okay to me, would be great if someone more familiar with delta spec than me would also review this

Copy link
Copy Markdown
Contributor

@findinpath findinpath Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr the above quote doesn't actually specify that the protocol and metadata actions are in the same Parquet file row group.
AFAIU, #17408 is about retrieving only the row groups which correspond to non-null filters.

For finding protocol or metadata there are different Parquet filters used:

case METADATA -> {
field = "id";
type = VARCHAR;
}
case PROTOCOL -> {
field = "minReaderVersion";
type = BIGINT;
}

In the absence of composed Parquet predicate pushdown, we're probably left with reading two times the checkpoint file (although this means reading actually two row groups - which are likely the same most of the time).

}
else {
tupleDomain = TupleDomain.all();
}

ReaderPageSource pageSource = ParquetPageSourceFactory.createPageSource(
checkpoint,
Expand Down