Populate split offsets when writing orc/parquet files in iceberg #22250

raunaqmorarka · 2024-06-03T20:18:50Z

Description

Offsets can be used by readers to align splits with row-group/stripe boundaries.
For data written this change, https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/OffsetsAwareSplitScanTaskIterator.java#L30 will be used to generate 1 split per parquet row-group or orc stripe.

Additional context and related issues

Fixes #9018

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Populate `split_offsets` in file metadata to allow faster reads. ({issue}`9018`)

findinpath · 2024-06-05T12:30:31Z

lib/trino-orc/src/main/java/io/trino/orc/OrcWriter.java

                .mapToLong(ColumnStatistics::getRetainedSizeInBytes)
                .sum()).orElse(0L);
+        stripeOffsets = closedStripes.stream()
+                .map(closedStripe -> closedStripe.getStripeInformation().getOffset())


could you pls add to the description of the PR that it relates to:

https://github.com/apache/iceberg/blob/0a26f02876dfb3b9bbfac6720fb2506326e97273/core/src/main/java/org/apache/iceberg/BaseContentScanTask.java#L102-L110

?

Now what does this mean on plain english for Trino to use OffsetsAwareSplitScanTaskIterator ?
How does this improve the efficiency of Trino read workloads?

It means that we get 1 split per parquet row-group or orc stripe instead of 128MB (or configured property) size split which might be mapped to 0-N row-groups depending on row group size configured in the writer. It should give us same or more parallelism and avoid any empty splits due to mis-alignment of logical split offsets with file row-group boundaries.
We were already getting the same behaviour when the data was written by some other engine/library which would populate split offsets.

It means that we get 1 split per parquet row-group or orc stripe instead of 128MB (or configured property) size split which might be mapped to 0-N row-groups depending on row group size configured in the writer.

Do we have minimal split size if the row groups are super tiny, e.g: to avoid reading Parquet footers over and over again for tiny tiny stripes?

We rely on iceberg library logic to split the file and it doesn't have such a fallback. There's no good reason why any writer would be producing a lot of small row-groups, it would end up nullifying all the advantage of a columnar file format. Keep in mind that other writers are already populating this field and we've already been relying on the same iceberg logic for data written in this way. If someone has such data due to some misconfiguration, they need to fix it through running OPTIMIZE or CTAS.

There's no good reason why any writer would be producing a lot of small row-groups

Right, but small row-groups could be a byproduct of wide rows (I think I've seen that happening)

They can be small in row count, but can't be small in size due to columns size and existing split size was not based on row count.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java

sopel39 · 2024-06-07T12:12:17Z

lib/trino-orc/src/main/java/io/trino/orc/OrcWriter.java

                .mapToLong(ColumnStatistics::getRetainedSizeInBytes)
                .sum()).orElse(0L);
+        stripeOffsets = closedStripes.stream()
+                .map(closedStripe -> closedStripe.getStripeInformation().getOffset())


It means that we get 1 split per parquet row-group or orc stripe instead of 128MB (or configured property) size split which might be mapped to 0-N row-groups depending on row group size configured in the writer.

Do we have minimal split size if the row groups are super tiny, e.g: to avoid reading Parquet footers over and over again for tiny tiny stripes?

sopel39 · 2024-06-07T12:22:11Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

                    .withFileSizeInBytes(task.fileSizeInBytes())
                    .withFormat(table.getFileFormat().toIceberg())
                    .withMetrics(task.metrics().metrics());
+            task.fileSplitOffsets().ifPresent(builder::withSplitOffsets);


how does this builder relates to IcebergPageSink?

Builder is an existing iceberg API, we're just using it to populate the offsets that we received from workers through CommitTaskData

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/ParquetUtil.java

Avoids creating too many splits when split offsets are populated

Offsets can be used by readers to align splits with row-group/stripe boundaries

cla-bot bot added the cla-signed label Jun 3, 2024

github-actions bot added the iceberg Iceberg connector label Jun 3, 2024

raunaqmorarka requested review from alexjo2144, electrum, findepi, findinpath and sopel39 June 3, 2024 20:19

raunaqmorarka force-pushed the ice-offsets branch 2 times, most recently from 05ba753 to d715350 Compare June 4, 2024 05:10

findinpath reviewed Jun 5, 2024

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java Show resolved Hide resolved

sopel39 approved these changes Jun 7, 2024

View reviewed changes

raunaqmorarka force-pushed the ice-offsets branch 2 times, most recently from 51ba5ea to 4102731 Compare June 7, 2024 13:16

raunaqmorarka added the performance label Jun 7, 2024

raunaqmorarka added 3 commits June 9, 2024 22:00

Reduce number of row groups in tests using small row groups

804fdff

Avoids creating too many splits when split offsets are populated

Fix detecting start of row group in parquet reader

052c60f

Populate split offsets when writing orc/parquet files in iceberg

48e023a

Offsets can be used by readers to align splits with row-group/stripe boundaries

raunaqmorarka force-pushed the ice-offsets branch from 4102731 to 48e023a Compare June 9, 2024 16:42

raunaqmorarka merged commit 4cc2421 into master Jun 10, 2024

raunaqmorarka deleted the ice-offsets branch June 10, 2024 04:29

github-actions bot added this to the 450 milestone Jun 10, 2024

colebow mentioned this pull request Jun 11, 2024

Add Trino 450 release notes #22327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Populate split offsets when writing orc/parquet files in iceberg #22250

Populate split offsets when writing orc/parquet files in iceberg #22250

Uh oh!

raunaqmorarka commented Jun 3, 2024 •

edited

Loading

Uh oh!

findinpath Jun 5, 2024

Uh oh!

raunaqmorarka Jun 5, 2024

Uh oh!

sopel39 Jun 7, 2024

Uh oh!

raunaqmorarka Jun 7, 2024

Uh oh!

sopel39 Jun 7, 2024

Uh oh!

raunaqmorarka Jun 7, 2024

Uh oh!

Uh oh!

sopel39 Jun 7, 2024

Uh oh!

sopel39 Jun 7, 2024

Uh oh!

raunaqmorarka Jun 7, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Populate split offsets when writing orc/parquet files in iceberg #22250

Populate split offsets when writing orc/parquet files in iceberg #22250

Uh oh!

Conversation

raunaqmorarka commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

raunaqmorarka commented Jun 3, 2024 •

edited

Loading