Use dynamic split sizes in hive connector#22051
Conversation
32f5022 to
4114f9d
Compare
a5ada00 to
6e52f59
Compare
6e52f59 to
9099406
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
@pranjalssh Would you document the new configuration property?
https://prestodb.io/docs/current/connector/hive.html#hive-configuration-properties
CC: @steveburnett
mbasmanova
left a comment
There was a problem hiding this comment.
@pranjalssh Thank you for working on this. What kind of speed up have you observed on production queries? It would be nice to update commit message to provide more details about this change. Is there a way to see whether this optimization kicked in and what was the 'ratio' applied after the query finished running?
| this.splitWeightProvider = isSizeBasedSplitWeightsEnabled(session) ? new SizeBasedSplitWeightProvider(getMinimumAssignedSplitWeight(session), maxSplitSize) : HiveSplitWeightProvider.uniformStandardWeightProvider(); | ||
| // Clamp value within [0.1, 1.0]. | ||
| // This ratio will be used to increase split sizes. The range implies | ||
| // 1) We do not increase more than 10x(>= 0.1) |
There was a problem hiding this comment.
Curious why is this limit?
It would be helpful to update commit message to provide more details about the implementation and these kinds of limits.
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
| HiveTableHandle hiveTableHandle = new HiveTableHandle(tableName.getSchemaName(), tableName.getTableName()); | ||
| List<HiveColumnHandle> allColumnHandles = new ArrayList<>(); | ||
| allColumnHandles.addAll(getRegularColumnHandles(table)); | ||
| allColumnHandles.addAll(getPartitionKeyColumnHandles(table)); |
There was a problem hiding this comment.
Why do we include partitioning keys? These are not part of the file.
| double totalSize = tableStatistics.getTotalSize().getValue(); | ||
| double requiredSize = 0; | ||
| double rowCount = tableStatistics.getRowCount().getValue(); | ||
| for (Map.Entry<ColumnHandle, ColumnStatistics> entry : tableStatistics.getColumnStatistics().entrySet()) { |
There was a problem hiding this comment.
Here we loop over all columns (can be thousands), but process only a set of readColumnHandles columns (can be a handful). Would it make sense to change this to loop over readColumnHandles instead?
There was a problem hiding this comment.
Refactored to only query for read columns. I previously incorrectly assumed totalSize for summed for just columns provided
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
| if (readColumnHandles.contains(entry.getKey())) { | ||
| double value = entry.getValue().getDataSize().getValue(); | ||
| // We do not compute total size stats for fixed width types, so count them manually. | ||
| if (!isFinite(value) && isFinite(rowCount)) { |
There was a problem hiding this comment.
isFinite(rowCount)
Can we move this check before the loop?
| for (Map.Entry<ColumnHandle, ColumnStatistics> entry : tableStatistics.getColumnStatistics().entrySet()) { | ||
| if (readColumnHandles.contains(entry.getKey())) { | ||
| double value = entry.getValue().getDataSize().getValue(); | ||
| // We do not compute total size stats for fixed width types, so count them manually. |
There was a problem hiding this comment.
Does that mean 'totalSize' doesn't include fixed-width columns?
There was a problem hiding this comment.
It does, its just missing in column stats. Refactored
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveSplitScheduling.java
Show resolved
Hide resolved
| TestHiveEventListenerPlugin.TestingHiveEventListener eventListener = getEventListener(); | ||
|
|
||
| // Wait for previous events to finish | ||
| Thread.sleep(2000); |
There was a problem hiding this comment.
Is there a way to avoid explicit sleep calls?
There was a problem hiding this comment.
I found that event manager is synchronous in these tests, so we can just remove all sleep calls
|
I have seen 90% reduction in number of splits for some queries with this option (380K vs 3M)! And corresponding reduction in latency. so this can help sometimes especially in things like count( * ) queries |
@pranjalssh, when you can, please add documentation as suggested by @mbasmanova and recommended in the Documentation topic of the Review and Commit Guidelines. When you do, you can use the request review feature to tag me so I'll be notified and able to respond in a timely way and not delay the PR. Thanks! |
a47125b to
99d773e
Compare
|
Addressed comments @steveburnett @mbasmanova |
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff ece7741...ce14167.
|
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local build, everything looks good.
99d773e to
b1e2cb2
Compare
| if (!isDynamicSplitSizesEnabled(session)) { | ||
| return ratio; | ||
| } | ||
| HiveTableHandle hiveTableHandle = new HiveTableHandle(tableName.getSchemaName(), tableName.getTableName()); |
There was a problem hiding this comment.
use the function mergeRequestedAndPredicateColumns(). It will also handle if the same struct columns appear in both sets with different subfields.
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
| return splitSource; | ||
| } | ||
|
|
||
| private double getSplitScanRatio( |
There was a problem hiding this comment.
Can you clarify what splitScanRatio means? The name is confusing to me. It looks like it's the proportion of bytes we expect to read based on column stats of columns we actually use compared to the total data size.
|
|
||
| TableStatistics tableStatistics = metadata.getHiveStatisticsProvider().getTableStatistics(session, tableName, readColumns, readColumnTypes, partitions); | ||
| double totalSize = tableStatistics.getTotalSize().getValue(); | ||
| double requiredSize = 0; |
There was a problem hiding this comment.
what does requiredSize mean? it looks like we're getting the total size of all columns we're selecting. What's "required" about it?
There was a problem hiding this comment.
Updated to readSize
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
b1e2cb2 to
f8a2b3f
Compare
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
Outdated
Show resolved
Hide resolved
| this.splitWeightProvider = isSizeBasedSplitWeightsEnabled(session) ? new SizeBasedSplitWeightProvider(getMinimumAssignedSplitWeight(session), maxSplitSize) : HiveSplitWeightProvider.uniformStandardWeightProvider(); | ||
| // Clamp value within [0.1, 1.0]. | ||
| // This ratio will be used to increase split sizes. The range implies | ||
| // 1) We do not increase more than 10x(>= 0.1) |
If data scanned by query is smaller than total size of files, we use larger splits than configured by Hive. We schedule upto 10x larger splits - which is a very generous limit - and being conservative not to schedule splits with too many rows
f8a2b3f to
ce14167
Compare
|
@rschlussel updated comment |
Description
FIxes #21911
Motivation and Context
Impact
Test Plan
Added unit test, gated changes behind config, and ran test queries on a cluster
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.