Remove Hive initial splits config properties by sopel39 · Pull Request #22787 · trinodb/trino

sopel39 · 2024-07-24T10:12:38Z

Initial splits cause split scheduling to be
non-deterministic, which causes split level cache misses.

Additionally, initial splits were most useful
for non-columnar file formats (e.g. text, JSON files), while it's not useful for modern era formats where split boundaries are determined in files themselves.

This prevents subquery cache misses due to non-deterministic split scheduling in Hive connector, see #21888 (comment)

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Remove `hive.max-initial-splits` and `hive.max-initial-split-size`. ({issue}`issuenumber`)

sopel39 · 2024-07-24T10:14:17Z

cc @assaf2

raunaqmorarka

I think we should remove the code altogether. Changing default doesn't solve the problem as there will be lots of configs where this is already set to some non-default value.

sopel39 · 2024-07-24T10:46:26Z

I think we should remove the code altogether. Changing default doesn't solve the problem as there will be lots of configs where this is already set to some non-default value.

@raunaqmorarka How about making it deprecated and removing it in next 2 releases?

raunaqmorarka · 2024-07-24T10:50:11Z

I think we should remove the code altogether. Changing default doesn't solve the problem as there will be lots of configs where this is already set to some non-default value.

@raunaqmorarka How about making it deprecated and removing it in next 2 releases?

I don't think we will learn anything new on this topic in next 2 releases, I would just remove it and mark as defunct

sopel39 · 2024-07-24T11:26:34Z

I don't think we will learn anything new on this topic in next 2 releases, I would just remove it and mark as defunct

Okey. Let's try

Initial splits cause split scheduling to be non-deterministic, which causes split level cache misses. Additionally, initial splits were most useful for non-columnar file formats (e.g. text, JSON files), while it's not useful for modern era formats where split boundaries are determined in files themselves.

dain · 2024-07-24T20:40:10Z

Ths code is present for to give queries a parallelism boost when they first start. I don't know what split-level caching is, but it sounds like a bad design if it is dependent on stable splits.

sopel39 · 2024-07-24T20:59:10Z

Ths code is present for to give queries a parallelism boost when they first start

Only if the file can be sliced into arbitrary small chunks, which is not the case for Parquet or ORC. Default initial split size is also 32MB, so much smaller than typical row-group size, so empty splits are produced usually. This feature is noop in most of the cases, but introduces unnecessary complexity.

I don't know what split-level caching is, but it sounds like a bad design if it is dependent on stable splits.

If you cache split data or split level computations rather that Parquet/ORC/file themselves, then split determinism is required (and actually provided by Iceberg/Delta). Parquet/ORC are rather slow formats, so cache like Alluxio is only helping partially and will loose every time against higher-level cache.

electrum

If the concern is that this is ineffective (and in fact counter-productive) for ORC/Parquet, why not disable it when generating splits for those formats?

It should still provide a benefit for text formats. We should only remove it entirely if there is no benefit, or if this improves something for Trino.

We don't have higher level caching in Trino. If we do add that feature, then removing initial splits can be revisited.

electrum · 2024-07-28T17:10:02Z

We can benchmark and possibly remove this after #22827 is merged.

sopel39 · 2024-07-29T09:59:17Z

If the concern is that this is ineffective (and in fact counter-productive) for ORC/Parquet, why not disable it when generating splits for those formats?

@electrum It's making complicated thing even more complicated.

It should still provide a benefit for text formats. We should only remove it entirely if there is no benefit, or if this improves something for Trino.

it's always complexity vs gain. Perf difference between raw text or JSON compared to ORC/Parquet is like comparing old truck vs race car. If people want perf, then they should migrate to other formats. If they stick with raw text, then they pay perf and cost penalty anyway. Making splits somewhat smaller won't change anything fundamentally.

sopel39 added the release-notes label Jul 24, 2024

sopel39 requested review from martint and raunaqmorarka July 24, 2024 10:12

cla-bot bot added the cla-signed label Jul 24, 2024

github-actions bot added docs hive Hive connector labels Jul 24, 2024

raunaqmorarka requested review from dain and electrum July 24, 2024 10:22

raunaqmorarka reviewed Jul 24, 2024

View reviewed changes

sopel39 force-pushed the ks/disable_initial branch from ed9ee33 to 5881184 Compare July 24, 2024 12:21

sopel39 changed the title ~~Disable Hive initial splits by default~~ Remove Hive initial splits config properties Jul 24, 2024

sopel39 requested a review from raunaqmorarka July 24, 2024 12:21

sopel39 force-pushed the ks/disable_initial branch from 5881184 to bb0e580 Compare July 24, 2024 12:45

sopel39 added 2 commits July 24, 2024 14:47

Remove max-initial-splits from Delta doc

ca5dd91

sopel39 force-pushed the ks/disable_initial branch from bb0e580 to ca5dd91 Compare July 24, 2024 12:47

raunaqmorarka approved these changes Jul 24, 2024

View reviewed changes

sopel39 mentioned this pull request Jul 24, 2024

Subquery cache & friends #21888

Closed

findepi approved these changes Jul 24, 2024

View reviewed changes

assaf2 approved these changes Jul 25, 2024

View reviewed changes

electrum requested changes Jul 28, 2024

View reviewed changes

sopel39 closed this Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Hive initial splits config properties#22787

Remove Hive initial splits config properties#22787
sopel39 wants to merge 2 commits intotrinodb:masterfrom
sopel39:ks/disable_initial

sopel39 commented Jul 24, 2024 •

edited

Loading

Uh oh!

sopel39 commented Jul 24, 2024

Uh oh!

raunaqmorarka left a comment

Uh oh!

sopel39 commented Jul 24, 2024

Uh oh!

raunaqmorarka commented Jul 24, 2024

Uh oh!

sopel39 commented Jul 24, 2024

Uh oh!

dain commented Jul 24, 2024

Uh oh!

sopel39 commented Jul 24, 2024 •

edited

Loading

Uh oh!

electrum left a comment

Uh oh!

electrum commented Jul 28, 2024

Uh oh!

sopel39 commented Jul 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

Conversation

sopel39 commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release notes

Uh oh!

sopel39 commented Jul 24, 2024

Uh oh!

raunaqmorarka left a comment

Choose a reason for hiding this comment

Uh oh!

sopel39 commented Jul 24, 2024

Uh oh!

raunaqmorarka commented Jul 24, 2024

Uh oh!

sopel39 commented Jul 24, 2024

Uh oh!

dain commented Jul 24, 2024

Uh oh!

sopel39 commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

electrum left a comment

Choose a reason for hiding this comment

Uh oh!

electrum commented Jul 28, 2024

Uh oh!

sopel39 commented Jul 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

sopel39 commented Jul 24, 2024 •

edited

Loading

sopel39 commented Jul 24, 2024 •

edited

Loading