Skip to content

Remove Hive initial splits config properties#22787

Closed
sopel39 wants to merge 2 commits intotrinodb:masterfrom
sopel39:ks/disable_initial
Closed

Remove Hive initial splits config properties#22787
sopel39 wants to merge 2 commits intotrinodb:masterfrom
sopel39:ks/disable_initial

Conversation

@sopel39
Copy link
Copy Markdown
Member

@sopel39 sopel39 commented Jul 24, 2024

Initial splits cause split scheduling to be
non-deterministic, which causes split level cache misses.

Additionally, initial splits were most useful
for non-columnar file formats (e.g. text, JSON files), while it's not useful for modern era formats where split boundaries are determined in files themselves.

This prevents subquery cache misses due to non-deterministic split scheduling in Hive connector, see #21888 (comment)

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Remove `hive.max-initial-splits` and `hive.max-initial-split-size`. ({issue}`issuenumber`)

@sopel39 sopel39 requested review from martint and raunaqmorarka July 24, 2024 10:12
@cla-bot cla-bot bot added the cla-signed label Jul 24, 2024
@github-actions github-actions bot added docs hive Hive connector labels Jul 24, 2024
@sopel39
Copy link
Copy Markdown
Member Author

sopel39 commented Jul 24, 2024

cc @assaf2

@raunaqmorarka raunaqmorarka requested review from dain and electrum July 24, 2024 10:22
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the code altogether. Changing default doesn't solve the problem as there will be lots of configs where this is already set to some non-default value.

@sopel39
Copy link
Copy Markdown
Member Author

sopel39 commented Jul 24, 2024

I think we should remove the code altogether. Changing default doesn't solve the problem as there will be lots of configs where this is already set to some non-default value.

@raunaqmorarka How about making it deprecated and removing it in next 2 releases?

@raunaqmorarka
Copy link
Copy Markdown
Member

I think we should remove the code altogether. Changing default doesn't solve the problem as there will be lots of configs where this is already set to some non-default value.

@raunaqmorarka How about making it deprecated and removing it in next 2 releases?

I don't think we will learn anything new on this topic in next 2 releases, I would just remove it and mark as defunct

@sopel39
Copy link
Copy Markdown
Member Author

sopel39 commented Jul 24, 2024

I don't think we will learn anything new on this topic in next 2 releases, I would just remove it and mark as defunct

Okey. Let's try

@sopel39 sopel39 force-pushed the ks/disable_initial branch from ed9ee33 to 5881184 Compare July 24, 2024 12:21
@sopel39 sopel39 changed the title Disable Hive initial splits by default Remove Hive initial splits config properties Jul 24, 2024
@sopel39 sopel39 requested a review from raunaqmorarka July 24, 2024 12:21
@sopel39 sopel39 force-pushed the ks/disable_initial branch from 5881184 to bb0e580 Compare July 24, 2024 12:45
sopel39 added 2 commits July 24, 2024 14:47
Initial splits cause split scheduling to be
non-deterministic, which causes split level cache misses.

Additionally, initial splits were most useful
for non-columnar file formats (e.g. text, JSON files),
while it's not useful for modern era formats where
split boundaries are determined in files themselves.
@sopel39 sopel39 force-pushed the ks/disable_initial branch from bb0e580 to ca5dd91 Compare July 24, 2024 12:47
@dain
Copy link
Copy Markdown
Member

dain commented Jul 24, 2024

Ths code is present for to give queries a parallelism boost when they first start. I don't know what split-level caching is, but it sounds like a bad design if it is dependent on stable splits.

@sopel39
Copy link
Copy Markdown
Member Author

sopel39 commented Jul 24, 2024

Ths code is present for to give queries a parallelism boost when they first start

Only if the file can be sliced into arbitrary small chunks, which is not the case for Parquet or ORC. Default initial split size is also 32MB, so much smaller than typical row-group size, so empty splits are produced usually. This feature is noop in most of the cases, but introduces unnecessary complexity.

I don't know what split-level caching is, but it sounds like a bad design if it is dependent on stable splits.

If you cache split data or split level computations rather that Parquet/ORC/file themselves, then split determinism is required (and actually provided by Iceberg/Delta). Parquet/ORC are rather slow formats, so cache like Alluxio is only helping partially and will loose every time against higher-level cache.

Copy link
Copy Markdown
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the concern is that this is ineffective (and in fact counter-productive) for ORC/Parquet, why not disable it when generating splits for those formats?

It should still provide a benefit for text formats. We should only remove it entirely if there is no benefit, or if this improves something for Trino.

We don't have higher level caching in Trino. If we do add that feature, then removing initial splits can be revisited.

@electrum
Copy link
Copy Markdown
Member

We can benchmark and possibly remove this after #22827 is merged.

@sopel39
Copy link
Copy Markdown
Member Author

sopel39 commented Jul 29, 2024

If the concern is that this is ineffective (and in fact counter-productive) for ORC/Parquet, why not disable it when generating splits for those formats?

@electrum It's making complicated thing even more complicated.

It should still provide a benefit for text formats. We should only remove it entirely if there is no benefit, or if this improves something for Trino.

it's always complexity vs gain. Perf difference between raw text or JSON compared to ORC/Parquet is like comparing old truck vs race car. If people want perf, then they should migrate to other formats. If they stick with raw text, then they pay perf and cost penalty anyway. Making splits somewhat smaller won't change anything fundamentally.

@sopel39 sopel39 closed this Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants