Support shuffle on Hive partition columns before write by wenleix · Pull Request #13969 · prestodb/presto

wenleix · 2020-01-15T20:09:16Z

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

Hive Changes
* Will be added later

wenleix · 2020-01-16T16:37:23Z

cc @mbasmanova , @kaikalur , @aweisberg

Note now it writes exactly file per partition, which might slow down the table write. We might want to introduce extra local round robin shuffle to increase number of file writers if necessary.

mbasmanova · 2020-01-16T16:57:09Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java

    private static final String PARTITIONS_TABLE_SUFFIX = "$partitions";
    private static final String PRESTO_TEMPORARY_TABLE_NAME_PREFIX = "__presto_temporary_table_";

+    private static final int SHUFFLE_MAX_PARALLELISM_FOR_PARTITIONED_TABLE_WRITE = 1009;


@wenleix Thanks for working on this. I have a few questions.

Note now it writes exactly file per partition, which might slow down the table write. We might want to introduce extra local round robin shuffle to increase number of file writers if necessary.

One file per partition might be too little. Is there a way to still use up to 100 writer threads per node to write files?

What's the significance of 1009? Is this the maximum number of dynamic partitions or something else?

How many nodes will be used for writing? hash_partition_count or some other number?

I wonder if larger numbers of writer threads per nodes actually work and if it still works on T1. One case where you would enable this is when a query writes to a large number of partitions on T1s. In that case it can use up all the available memory for the ORC encoding buffers.

I think this is useful as is because it takes a query that doesn't run at all and makes it able to run. That is strictly better than it being unable to run on T1.

Should the bucket count match the actual hash partition count so we don't have to map from a constant number of buckets to a different number of nodes actually executing the stage?

Thanks @mbasmanova and @aweisberg for the comments!

One file per partition might be too little. Is there a way to still use up to 100 writer threads per node to write files?

In current approach it's difficult, as Presto still thinks it's the "table partitioning", thus it will do local exchange comply to table partitioning.

I do agree one file per partition is very inflexible. To solve this, we might want to differentiate between "table partitioning" and "write/shuffle partitioning" . For the later one, the local exchange can be a simple round robin. What do you think , @arhimondr

What's the significance of 1009? Is this the maximum number of dynamic partitions or something else?

I choose a prime number that is large enough (>1000). The reason is Hive bucket function is reported to be degenerated when the bucket column value has some pattern. -- I can cc you on the internal FB post.

How many nodes will be used for writing? hash_partition_count or some other number?

I think it will be max_tasks_per_stage but I can double check.

@aweisberg

I wonder if larger numbers of writer threads per nodes actually work and if it still works on T1. One case where you would enable this is when a query writes to a large number of partitions on T1s. In that case it can use up all the available memory for the ORC encoding buffers.

I agree. For T1 we probably wants to configure it to 1. But we might still want to have some flexibility in terms of how many files we can have per partition.

Should the bucket count match the actual hash partition count so we don't have to map from a constant number of buckets to a different number of nodes actually executing the stage?

No it doesn't have to. Bucket will be mapped to nodes in a random and "round robin" fashion. See NodePartitioningManager#createArbitraryBucketToNode :

presto/presto-main/src/main/java/com/facebook/presto/sql/planner/NodePartitioningManager.java

Lines 218 to 228 in f555aa6

private static List<InternalNode> createArbitraryBucketToNode(List<InternalNode> nodes, int bucketCount)

{

List<InternalNode> shuffledNodes = new ArrayList<>(nodes);

Collections.shuffle(shuffledNodes);

ImmutableList.Builder<InternalNode> distribution = ImmutableList.builderWithExpectedSize(bucketCount);

for (int i = 0; i < bucketCount; i++) {

distribution.add(shuffledNodes.get(i % shuffledNodes.size()));

}

return distribution.build();

}

mbasmanova · 2020-01-16T16:57:46Z

CC: @biswapesh

aweisberg · 2020-01-17T19:58:46Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java


        Optional<HiveBucketHandle> hiveBucketHandle = getHiveBucketHandle(table);
+
        if (!hiveBucketHandle.isPresent()) {


So this case is basically, it hasn't already been bucketed for us by the user, so we are going to pretend it's bucketed when writing out the files?

Right. If the table is bucketed we have to follow the table bucketing. No other way round.

Now I rethink about it, maybe distinguish between table data partitioning and table write shuffle partitioning (or some other name) might actually make code easier to understand. Otherwise -- why it's actually not partitioned by XX but we pretend them to be partitioned by XX? . And this "if bucketed, use bucketing, otherwise, use partition column" can be a bit difficult to understand and maintain.

cc @arhimondr

aweisberg · 2020-01-17T20:01:31Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java

@@ -2396,9 +2416,31 @@ public Optional<ConnectorNewTableLayout> getNewTableLayout(ConnectorSession sess
        validatePartitionColumns(tableMetadata);


Is this for when they create the table at the same time as the select and we want to make sure we write it out as if it was bucketed? In other words it won't ever come back and call getInsertLayout before inserting?

aweisberg

Just had some questions about how this works in practice.

Also wondering about the hard coded 10009 bucket count.

I am interested in seeing this in action as we work on tuning the queries that hit this issue.

mbasmanova · 2020-01-21T09:56:20Z

@wenleix Wenlei, thanks for explaining.

we might want to differentiate between "table partitioning" and "write/shuffle partitioning"

I like this proposal.

wenleix · 2020-01-24T18:21:12Z

Superseded by #14010

wenleix · 2020-02-10T21:03:59Z

Superseded by #14010

wenleix force-pushed the many_part branch from 021d557 to 0330dc3 Compare January 16, 2020 06:42

Support shuffle on Hive partition columns before write

176e58b

wenleix force-pushed the many_part branch from 0330dc3 to 176e58b Compare January 16, 2020 06:44

wenleix changed the title ~~[WIP] Support shuffle on Hive partition columns before write~~ Support shuffle on Hive partition columns before write Jan 16, 2020

mbasmanova reviewed Jan 16, 2020

View reviewed changes

wenleix requested a review from arhimondr January 16, 2020 19:23

arhimondr approved these changes Jan 17, 2020

View reviewed changes

aweisberg reviewed Jan 17, 2020

View reviewed changes

wenleix mentioned this pull request Jan 24, 2020

Support shuffle on Hive partition columns before write #14010

Merged

wenleix closed this Feb 10, 2020

	private static List<InternalNode> createArbitraryBucketToNode(List<InternalNode> nodes, int bucketCount)
	{
	List<InternalNode> shuffledNodes = new ArrayList<>(nodes);
	Collections.shuffle(shuffledNodes);

	ImmutableList.Builder<InternalNode> distribution = ImmutableList.builderWithExpectedSize(bucketCount);
	for (int i = 0; i < bucketCount; i++) {
	distribution.add(shuffledNodes.get(i % shuffledNodes.size()));
	}
	return distribution.build();
	}


		Optional<HiveBucketHandle> hiveBucketHandle = getHiveBucketHandle(table);

		if (!hiveBucketHandle.isPresent()) {

		@@ -2396,9 +2416,31 @@ public Optional<ConnectorNewTableLayout> getNewTableLayout(ConnectorSession sess
		validatePartitionColumns(tableMetadata);

Conversation

wenleix commented Jan 15, 2020

Uh oh!

wenleix commented Jan 16, 2020

Uh oh!

mbasmanova Jan 16, 2020

Choose a reason for hiding this comment

Uh oh!

aweisberg Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

aweisberg Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

wenleix Jan 21, 2020

Choose a reason for hiding this comment

Uh oh!

wenleix Jan 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Jan 16, 2020

Uh oh!

aweisberg Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

wenleix Jan 21, 2020

Choose a reason for hiding this comment

Uh oh!

aweisberg Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweisberg left a comment

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Jan 21, 2020

Uh oh!

wenleix commented Jan 24, 2020

Uh oh!

wenleix commented Feb 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wenleix Jan 21, 2020 •

edited

Loading

aweisberg Jan 17, 2020 •

edited

Loading