Skip to content

Accelerate Iceberg when reading partition columns only#19303

Merged
findepi merged 1 commit intotrinodb:masterfrom
findepi:findepi/iceberg-count-only
Oct 16, 2023
Merged

Accelerate Iceberg when reading partition columns only#19303
findepi merged 1 commit intotrinodb:masterfrom
findepi:findepi/iceberg-count-only

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Oct 6, 2023

Description

Avoid data files I/O when

  • reading only partitioning columns
  • doing count(*) queries on Iceberg tables (with no group by, or grouping by partitioning columns)

@findepi findepi requested review from alexjo2144 and ebyhr October 6, 2023 21:27
@cla-bot cla-bot bot added the cla-signed label Oct 6, 2023
@findepi findepi force-pushed the findepi/iceberg-count-only branch from 7f1a960 to 7a10c4b Compare October 6, 2023 21:28
@findepi findepi changed the title Findepi/iceberg count only Process count(*) on Iceberg without opening data files Oct 6, 2023
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Oct 6, 2023

cc @osscm

Copy link
Copy Markdown
Contributor

@findinpath findinpath Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests when doing count(*) from the whole table.

Add a test involving a filter on a non partition column on a partitioned table.

Also important check the file system accesses for min(C)/max(C) .

Also add another test when there are delete files for the table.

@findepi findepi changed the title Process count(*) on Iceberg without opening data files Accelerate Iceberg when reading partition columns only Oct 7, 2023
@findepi findepi force-pushed the findepi/iceberg-count-only branch from 7a10c4b to b587f51 Compare October 9, 2023 11:48
Manifests contain trustworthy information about record count, so it can
be used to answer the count(*) queries.
@findepi findepi force-pushed the findepi/iceberg-count-only branch from b587f51 to b756564 Compare October 16, 2023 09:48
@maswin
Copy link
Copy Markdown
Member

maswin commented Jan 23, 2025

We have a query which joins with a table only on partition columns.
Due to this part, we are not able to utilize the combined table scan.

if (wholeFileTask.deletes().isEmpty() && noDataColumnsProjected(wholeFileTask)) {
    fileTasksIterator = List.of(wholeFileTask).iterator();
}
else {
    fileTasksIterator = wholeFileTask.split(targetSplitSize).iterator();
}

The table has 1 million small files, but it is creating more than 1 million splits since it is not converting this scan into SplittableScanTask
Any reason why we are skipping combining splits when no data columns are projected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

4 participants