-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Support shuffle on Hive partition columns before write #13969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -171,6 +171,7 @@ | |
| import static com.facebook.presto.hive.HiveSessionProperties.isOfflineDataDebugModeEnabled; | ||
| import static com.facebook.presto.hive.HiveSessionProperties.isOptimizedMismatchedBucketCount; | ||
| import static com.facebook.presto.hive.HiveSessionProperties.isRespectTableFormat; | ||
| import static com.facebook.presto.hive.HiveSessionProperties.isShufflePartitionedColumnsForTableWriteEnabled; | ||
| import static com.facebook.presto.hive.HiveSessionProperties.isSortedWriteToTempPathEnabled; | ||
| import static com.facebook.presto.hive.HiveSessionProperties.isSortedWritingEnabled; | ||
| import static com.facebook.presto.hive.HiveSessionProperties.isStatisticsEnabled; | ||
|
|
@@ -278,6 +279,8 @@ public class HiveMetadata | |
| private static final String PARTITIONS_TABLE_SUFFIX = "$partitions"; | ||
| private static final String PRESTO_TEMPORARY_TABLE_NAME_PREFIX = "__presto_temporary_table_"; | ||
|
|
||
| private static final int SHUFFLE_MAX_PARALLELISM_FOR_PARTITIONED_TABLE_WRITE = 1009; | ||
|
|
||
| private final boolean allowCorruptWritesForTesting; | ||
| private final SemiTransactionalHiveMetastore metastore; | ||
| private final HdfsEnvironment hdfsEnvironment; | ||
|
|
@@ -2369,9 +2372,26 @@ public Optional<ConnectorNewTableLayout> getInsertLayout(ConnectorSession sessio | |
| .orElseThrow(() -> new TableNotFoundException(tableName)); | ||
|
|
||
| Optional<HiveBucketHandle> hiveBucketHandle = getHiveBucketHandle(table); | ||
|
|
||
| if (!hiveBucketHandle.isPresent()) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this case is basically, it hasn't already been bucketed for us by the user, so we are going to pretend it's bucketed when writing out the files?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. If the table is bucketed we have to follow the table bucketing. No other way round. Now I rethink about it, maybe distinguish between table data partitioning and table write shuffle partitioning (or some other name) might actually make code easier to understand. Otherwise -- why it's actually not partitioned by XX but we pretend them to be partitioned by XX? . And this "if bucketed, use bucketing, otherwise, use partition column" can be a bit difficult to understand and maintain. cc @arhimondr |
||
| if (isShufflePartitionedColumnsForTableWriteEnabled(session) && !table.getPartitionColumns().isEmpty()) { | ||
| // TODO: the shuffle partitioning doesnt't have to be the same as Hive bucket partitioning | ||
| HivePartitioningHandle partitioningHandle = new HivePartitioningHandle( | ||
| SHUFFLE_MAX_PARALLELISM_FOR_PARTITIONED_TABLE_WRITE, | ||
| table.getPartitionColumns().stream() | ||
| .map(Column::getType) | ||
| .collect(Collectors.toList()), | ||
| OptionalInt.empty()); | ||
| List<String> partitionedBy = table.getPartitionColumns().stream() | ||
| .map(Column::getName) | ||
| .collect(Collectors.toList()); | ||
|
|
||
| return Optional.of(new ConnectorNewTableLayout(partitioningHandle, partitionedBy)); | ||
| } | ||
|
|
||
| return Optional.empty(); | ||
| } | ||
|
|
||
| HiveBucketProperty bucketProperty = table.getStorage().getBucketProperty() | ||
| .orElseThrow(() -> new NoSuchElementException("Bucket property should be set")); | ||
| if (!bucketProperty.getSortedBy().isEmpty() && !isSortedWritingEnabled(session)) { | ||
|
|
@@ -2396,9 +2416,31 @@ public Optional<ConnectorNewTableLayout> getNewTableLayout(ConnectorSession sess | |
| validatePartitionColumns(tableMetadata); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this for when they create the table at the same time as the select and we want to make sure we write it out as if it was bucketed? In other words it won't ever come back and call getInsertLayout before inserting? |
||
| validateBucketColumns(tableMetadata); | ||
| Optional<HiveBucketProperty> bucketProperty = getBucketProperty(tableMetadata.getProperties()); | ||
|
|
||
| if (!bucketProperty.isPresent()) { | ||
| List<String> partitionedBy = getPartitionedBy(tableMetadata.getProperties()); | ||
| if (isShufflePartitionedColumnsForTableWriteEnabled(session) && !partitionedBy.isEmpty()) { | ||
| List<HiveColumnHandle> columnHandles = getColumnHandles(tableMetadata, ImmutableSet.copyOf(partitionedBy), typeTranslator); | ||
| Map<String, HiveColumnHandle> columnHandlesByName = Maps.uniqueIndex(columnHandles, HiveColumnHandle::getName); | ||
| List<Column> partitionColumns = partitionedBy.stream() | ||
| .map(columnHandlesByName::get) | ||
| .map(column -> new Column(column.getName(), column.getHiveType(), column.getComment())) | ||
| .collect(toList()); | ||
|
|
||
| // TODO: the shuffle partitioning doesnt't have to be the same as Hive bucket partitioning | ||
| HivePartitioningHandle partitioningHandle = new HivePartitioningHandle( | ||
| SHUFFLE_MAX_PARALLELISM_FOR_PARTITIONED_TABLE_WRITE, | ||
| partitionColumns.stream() | ||
| .map(Column::getType) | ||
| .collect(Collectors.toList()), | ||
| OptionalInt.empty()); | ||
|
|
||
| return Optional.of(new ConnectorNewTableLayout(partitioningHandle, partitionedBy)); | ||
| } | ||
|
|
||
| return Optional.empty(); | ||
| } | ||
|
|
||
| if (!bucketProperty.get().getSortedBy().isEmpty() && !isSortedWritingEnabled(session)) { | ||
| throw new PrestoException(NOT_SUPPORTED, "Writing to bucketed sorted Hive tables is disabled"); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenleix Thanks for working on this. I have a few questions.
One file per partition might be too little. Is there a way to still use up to 100 writer threads per node to write files?
What's the significance of
1009? Is this the maximum number of dynamic partitions or something else?How many nodes will be used for writing? hash_partition_count or some other number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if larger numbers of writer threads per nodes actually work and if it still works on T1. One case where you would enable this is when a query writes to a large number of partitions on T1s. In that case it can use up all the available memory for the ORC encoding buffers.
I think this is useful as is because it takes a query that doesn't run at all and makes it able to run. That is strictly better than it being unable to run on T1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the bucket count match the actual hash partition count so we don't have to map from a constant number of buckets to a different number of nodes actually executing the stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mbasmanova and @aweisberg for the comments!
In current approach it's difficult, as Presto still thinks it's the "table partitioning", thus it will do local exchange comply to table partitioning.
I do agree one file per partition is very inflexible. To solve this, we might want to differentiate between "table partitioning" and "write/shuffle partitioning" . For the later one, the local exchange can be a simple round robin. What do you think , @arhimondr
I choose a prime number that is large enough (>1000). The reason is Hive bucket function is reported to be degenerated when the bucket column value has some pattern. -- I can cc you on the internal FB post.
I think it will be
max_tasks_per_stagebut I can double check.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aweisberg
I agree. For T1 we probably wants to configure it to 1. But we might still want to have some flexibility in terms of how many files we can have per partition.
No it doesn't have to. Bucket will be mapped to nodes in a random and "round robin" fashion. See NodePartitioningManager#createArbitraryBucketToNode :
presto/presto-main/src/main/java/com/facebook/presto/sql/planner/NodePartitioningManager.java
Lines 218 to 228 in f555aa6