Skip to content

Conversation

@raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Jun 3, 2024

Description

Offsets can be used by readers to align splits with row-group/stripe boundaries.
For data written this change, https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/OffsetsAwareSplitScanTaskIterator.java#L30 will be used to generate 1 split per parquet row-group or orc stripe.

Additional context and related issues

Fixes #9018

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Populate `split_offsets` in file metadata to allow faster reads. ({issue}`9018`)

@cla-bot cla-bot bot added the cla-signed label Jun 3, 2024
@github-actions github-actions bot added the iceberg Iceberg connector label Jun 3, 2024
@raunaqmorarka raunaqmorarka force-pushed the ice-offsets branch 2 times, most recently from 05ba753 to d715350 Compare June 4, 2024 05:10
.mapToLong(ColumnStatistics::getRetainedSizeInBytes)
.sum()).orElse(0L);
stripeOffsets = closedStripes.stream()
.map(closedStripe -> closedStripe.getStripeInformation().getOffset())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you pls add to the description of the PR that it relates to:

https://github.com/apache/iceberg/blob/0a26f02876dfb3b9bbfac6720fb2506326e97273/core/src/main/java/org/apache/iceberg/BaseContentScanTask.java#L102-L110

?

Now what does this mean on plain english for Trino to use OffsetsAwareSplitScanTaskIterator ?
How does this improve the efficiency of Trino read workloads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that we get 1 split per parquet row-group or orc stripe instead of 128MB (or configured property) size split which might be mapped to 0-N row-groups depending on row group size configured in the writer. It should give us same or more parallelism and avoid any empty splits due to mis-alignment of logical split offsets with file row-group boundaries.
We were already getting the same behaviour when the data was written by some other engine/library which would populate split offsets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that we get 1 split per parquet row-group or orc stripe instead of 128MB (or configured property) size split which might be mapped to 0-N row-groups depending on row group size configured in the writer.

Do we have minimal split size if the row groups are super tiny, e.g: to avoid reading Parquet footers over and over again for tiny tiny stripes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely on iceberg library logic to split the file and it doesn't have such a fallback. There's no good reason why any writer would be producing a lot of small row-groups, it would end up nullifying all the advantage of a columnar file format. Keep in mind that other writers are already populating this field and we've already been relying on the same iceberg logic for data written in this way. If someone has such data due to some misconfiguration, they need to fix it through running OPTIMIZE or CTAS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no good reason why any writer would be producing a lot of small row-groups

Right, but small row-groups could be a byproduct of wide rows (I think I've seen that happening)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be small in row count, but can't be small in size due to columns size and existing split size was not based on row count.

.mapToLong(ColumnStatistics::getRetainedSizeInBytes)
.sum()).orElse(0L);
stripeOffsets = closedStripes.stream()
.map(closedStripe -> closedStripe.getStripeInformation().getOffset())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that we get 1 split per parquet row-group or orc stripe instead of 128MB (or configured property) size split which might be mapped to 0-N row-groups depending on row group size configured in the writer.

Do we have minimal split size if the row groups are super tiny, e.g: to avoid reading Parquet footers over and over again for tiny tiny stripes?

.withFileSizeInBytes(task.fileSizeInBytes())
.withFormat(table.getFileFormat().toIceberg())
.withMetrics(task.metrics().metrics());
task.fileSplitOffsets().ifPresent(builder::withSplitOffsets);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this builder relates to IcebergPageSink?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builder is an existing iceberg API, we're just using it to populate the offsets that we received from workers through CommitTaskData

@raunaqmorarka raunaqmorarka force-pushed the ice-offsets branch 2 times, most recently from 51ba5ea to 4102731 Compare June 7, 2024 13:16
Avoids creating too many splits when split offsets are populated
Offsets can be used by readers to align splits with row-group/stripe boundaries
@raunaqmorarka raunaqmorarka merged commit 4cc2421 into master Jun 10, 2024
@raunaqmorarka raunaqmorarka deleted the ice-offsets branch June 10, 2024 04:29
@github-actions github-actions bot added this to the 450 milestone Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Populate split_offsets in Iceberg metadata

4 participants