Implement Parallel Partition Pruning for Glue Hive Metastore by anoopj · Pull Request #1465 · trinodb/trino

anoopj · 2019-09-06T21:45:04Z

This change parallelizes the partition fetch for the Glue metastore by
splitting the partitions into non-overlapping segments[2]. This can speed
up query planning by upto an order of magnitude.

[1] https://docs.aws.amazon.com/glue/latest/webapi/API_Segment.html

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

findepi · 2019-09-08T13:13:55Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

nit: per code style, if there are too many params to put on one line, we put every on a separate line:

public GlueHiveMetastore( HdfsEnvironment hdfsEnvironment, GlueHiveMetastoreConfig glueConfig, @ForGlueHiveMetastore Executor executor)

This still needs updating

findepi · 2019-09-08T13:14:14Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

requireNonNull(executor, "executor is null");

findepi · 2019-09-08T13:15:39Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

we prefer immutable collections, so: toImmutableList()

however, in this case you don't need the list at all:

.forEach(segment -> completionService.submit(() -> ....);

findepi · 2019-09-08T13:18:18Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

we don't use abbreviations like "num"

maybe getPartitionThreads ?

findepi · 2019-09-08T13:20:01Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

no need for "Glue" here

add "total" to emphasize this is total cap, not per invocation:

maxGetPartitionTotalThreads ?

Also, what is the rationale for 50 as the default?
What's the default request limit for Glue GetPartition call?
What's the typical duration of a call?

Also, should this be off by default?

Will change variable names. Typically the call is expected to take hundreds of millis. 5 concurrent segments is conservative enough to be one by default.

findepi · 2019-09-08T13:20:26Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

This needs to be renamed appropriately, see comment at field name.

findepi · 2019-09-08T13:21:05Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueMetastoreModule.java

When hiveConfig.getMaxGlueGetPartitionThreads() == 1 we could return directExecutor() here.

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

findepi · 2019-09-08T13:22:47Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

Sort partitions before returning. We want planning to be deterministic (as much as possible).

When sorting, please do this outside of try block, so that try-catch encompasses as little as possible.

OK. Will do the sorting based on partition values.

findepi · 2019-09-08T13:24:49Z

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

Adding this method may change number of test methods effectively run in TestHiveGlueMetastore. Please extract to a separate commit.

This change will cause all of the tests that depend on existing tables to fail (they are currently skipped by the method in the super class). What is the reason for this change?

I found that the getPartitions tests are getting skipped without this change. What is the best way to test them?

It's complicated. Travis invokes presto-hive-hadoop2/bin/run_hive_tests.sh which

starts up HDFS and a Hive metastore in Docker

runs the presto-hive/src/test/sql/create-test.sql Hive script to create test tables

runs the AbstractTestHive tests against this environment

The purpose is to create tables in various ways using Hive, then make sure that Presto can read them. That's where the "existing tables" part comes from -- they already exist when the tests run.

It's fine to leave this out, since there are plenty of existing tests that exercise the metastore partition calls. I verified this by running TestHiveInMemoryMetastore with logging on the various getPartition* metastore calls.

findepi · 2019-09-08T13:30:05Z

@anoopj technically, this looks decent.
Instead of one bigger call, we're now making 5 smaller calls in parallel and the PR introduces the necessary plumbing for this.
However, I don't understand yet what problem it is solving.

Are we improving Glue throughput this way?
Why isn't Glue service doing its work in parallel on its own, if this improves its performance?
Is it safe to enable this by default? (eg API rate limits, see https://groups.google.com/d/msg/presto-users/pWFEADyLNUc/pQ70_eDKAAAJ)

anoopj · 2019-09-12T00:03:47Z

@findepi

@anoopj technically, this looks decent.
Instead of one bigger call, we're now making 5 smaller calls in parallel and the PR introduces the necessary plumbing for this.
However, I don't understand yet what problem it is solving.
Are we improving Glue throughput this way?

This change can improve query planning time on heavily partitioned tables because we can do the scan of the partitions in parallel. For large tables with millions of partitions, a query can be stuck in the planning phase for dozens of minutes. My tests show that the query planning time can be improved by up to an order of magnitude.

Why isn't Glue service doing its work in parallel on its own, if this improves its performance?

Please note that Glue GetPartitions is a paginated API and it returns only a set of partitions. So even if the Glue service did parallel reads across segments, it is not likely to help clients because they would be making the same number of calls anyway to the service.

Is it safe to enable this by default? (eg API rate limits, see https://groups.google.com/d/msg/presto-users/pWFEADyLNUc/pQ70_eDKAAAJ)

I think 5 is a conservative number and should be a safe default. For what it's worth, this is the default followed by Spark and Hive on EMR. (search for segments on the doc page)

We can add some documentation in Presto about the new setting and advise Presto users to either adjust this setting if they run into throttling or contact AWS to raise the throttling limits.

findepi · 2019-09-13T20:59:45Z

Please note that Glue GetPartitions is a paginated API and it returns only a set of partitions. So even if the Glue service did parallel reads across segments, it is not likely to help clients because they would be making the same number of calls anyway to the service.

good point.

findepi

some more minor comments

@anoopj let me know when you AC.

findepi · 2019-09-13T21:00:44Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/ForGlueHiveMetastore.java

nit

Suggested change

public @interface ForGlueHiveMetastore

public @interface ForGlueHiveMetastore {}

findepi · 2019-09-13T21:02:02Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

When sorting, please do this outside of try block, so that try-catch encompasses as little as possible.

findepi · 2019-09-13T21:02:42Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

it's hierarchical: next page (token) within a segment, so move withSegment before withNextToken

findepi · 2019-09-13T21:04:20Z

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

Unrelated, so best to separate commit or drop the change.

anoopj · 2019-09-14T01:02:34Z

@findepi Updated the PR incorporating the review comments.

findepi · 2019-09-14T12:19:47Z

I have some doubts about intuitiveness of the configuration.

for hiveConfig.getMaxGlueGetPartitionThreads() == N, we have at most N threads calling Glue, regardless of number of queries being planned. This is both: parallelization and form of throttling (in case of multiple queries)
for hiveConfig.getMaxGlueGetPartitionThreads() == 1 it could be expected that we have (at most) 1 thread calling Glue.
Instead, we have as many threads as the number of queries being planned, because no thread pool is being used.

This may be confusing to administrators. Also, we don't have option to configure "1 thread throttle". At least in theory, this might be an issue in an organization with multiple clusters, hitting Glue API limits.

I think we could use 0 as a special value (no thread pool). And any value >0 to configure the pool.
Any better options?

@electrum what's your thinking?

anoopj · 2019-09-15T15:52:01Z

Currently when hiveConfig.getMaxGlueGetPartitionThreads() == 1, we are using a direct executor. Why don't we just use a threadpool of size 1 so that the behavior is consistent?

electrum · 2019-09-15T18:16:48Z

Let's rename the config to GlueMaxParallelPartitionThreads and keep the special behavior for 1 being a direct executor. I notice we bypass the executor for totalSegments == 1, so it's already not a global limit. It's probably confusing no matter what we do, so we can just document the behavior.

electrum

A few minor comments. Overall code looks good.

electrum · 2019-09-15T18:06:42Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueMetastoreModule.java

This names seems wrong, since it's not for caching. Could just be createExecutor since the scope is for the Glue metastore module.

Sure thing.

This still needs updating

Sorry for missing that one.

electrum · 2019-09-15T18:19:24Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

I think this would be easier to read as a traditional for loop

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

electrum · 2019-09-15T18:23:11Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

We shouldn't need to convert to string here, as List<String> is naturally comparable with the same semantics

Also this should probably be partitions.sort() rather than Collections.sort(partitions)

Will change to partitions.sort(). Not using the string would require me to write a custom comparator that compares List since List is not Comparable. Something like:

partitions.sort((p1, p2) -> { List<String> values1 = p1.getValues(); List<String> values2 = p2.getValues(); if (values1.size() != values2.size()) { return values1.size() - values2.size(); } for (int i = 0; i < values1.size(); i++) { int c = values1.get(i).compareTo(values2.get(i)); if (c != 0) { return c; } } return 0; });

Does that sound reasonable?

partitions.sort(com.google.common.collect.Ordering.natural().lexicographical())

(there should be a way to do this without guava... but i didn't find it.)

electrum · 2019-09-15T18:30:54Z

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

This catch block is not needed -- any exception thrown by the test method will fail the test.

Sure. Had to add throws to the parent class method too.

Yep, throws Exception is very common for test methods (and only for test methods)

electrum · 2019-09-15T18:31:53Z

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

This change will cause all of the tests that depend on existing tables to fail (they are currently skipped by the method in the super class). What is the reason for this change?

findepi · 2019-09-15T19:20:18Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueMetastoreModule.java

use ordinary if

if (hiveConfig.getTotalGetPartitionThreads() == 1) { return directExecutor(); }

findepi · 2019-09-15T19:20:59Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

@electrum how to say that 1 is a special value here?

The difference is subtle and complicated to explain. I think the description here is fine, though we could document it in the main Hive documentation.

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

findepi · 2019-09-15T19:23:08Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

Also this should probably be partitions.sort() rather than Collections.sort(partitions)

findepi · 2019-09-15T19:24:26Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

i don't think "total" should be here.

@electrum, hive.metastore.glue.partitions-segments ?

Please let me know if this needs a change.

Agreed, let's change this as @findepi suggested. The "total segments" makes sense in terms of the parameter in the API call (since it is saying X of Y), but "partition segments" seems better for a configuration name.

anoopj · 2019-09-21T01:42:17Z

@electrum @findepi Responded to your feedback and have some questions. Please take a look.

findepi · 2019-09-24T11:22:56Z

@anoopj a new thought.
I guess that the most common case are queries that hit only 1-2 partitions.
In that case, you will still issue 5 requests even though 1 would be enough.
Maybe we should consider a mixed strategy: request first page of partitions.
If there is a next page, go parallel (possibly discarding the first page).

anoopj · 2019-09-25T16:28:48Z

@findepi My recommendation is to keep it simple and use parallel calls because a typical query usually spans way more than 5 partitions. Also, the default throttling of Glue is way higher and should allow several concurrent queries (and can be raised by contacting AWS).

electrum · 2019-09-26T21:50:43Z

I agree with keeping the code simple, but I also agree with @findepi that we could be a bit smarter about parallel reads. One idea is to have a minPartitionsPerSegment, which defaults to a smallish number like 5. Then we could do

total = min(((partitionCount - 1) / minPartitionsPerSegment) + 1, totalSegments);

This would limit parallelism for small numbers of partitions while adding minimal complexity.

sopel39 · 2019-09-27T08:58:13Z

One idea is to have a minPartitionsPerSegment, which defaults to a smallish number like 5

Maybe minPartitionsPerSegment should be equal to max page size (when paginating) of Glue request.

anoopj · 2019-09-27T16:59:00Z

@electrum @findepi @sopel39

The challenge here is that we don't know what the number of matching partitions a-priori. This could vary based on the number of partitions for the table and how selective the filters were. Please let me know if I misunderstood.

anoopj · 2019-10-01T18:57:04Z

@electrum @findepi

Ping

findepi · 2019-10-01T21:24:06Z

Sorry, i missed last update.

a typical query usually spans way more than 5 partitions

If this is an assumption, then we have divergent assumptions here.
If this is a fact, can you please elaborate?

The challenge here is that we don't know what the number of matching partitions a-priori.

I am aware. This is why i proposed to go parallel only after first call.

Maybe we try to impl that and see how much we sacrifice on code simplicity?
e.g. a commit on top of the existing PR would help

anoopj · 2019-10-07T21:48:19Z

@findepi @electrum @sopel39

I'm more worried about making an unnecessary API call to switch to parallel partition pruning. In my experience using partitioned tables, the tables were typically partitioned using a set of keys and also a time series field and many queries were hitting several partitions with some queries spending more time in query planning than execution and hence this PR.

If there are concerns, we could make this serial by default.

electrum

If no one has strong objections, I think this is good to merge with the current behavior. It's a win in the case of many partitions, and for the few partitions case, it likely doesn't add enough overhead to matter. Otherwise, I think we'd need to collect real stats to compare the different cases in various scenarios.

@anoopj can you rebase and address all of the existing comments so that we can get this merged?

electrum · 2019-10-08T21:46:10Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

Agreed, let's change this as @findepi suggested. The "total segments" makes sense in terms of the parameter in the API call (since it is saying X of Y), but "partition segments" seems better for a configuration name.

electrum · 2019-10-08T21:47:39Z

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

Yep, throws Exception is very common for test methods (and only for test methods)

electrum · 2019-10-09T00:50:01Z

@dain and I were looking at this and noticed a big inefficiency in the interaction between Presto and the Glue API design. Presto accesses partition information in two phases:

All matching partition names are fetched during planning. This needs to be fast, since it happens before the execution starts. The TupleDomain is translated in into a metastore filter expression, then the list of partitions names is filtered on the Presto side using a black box Predicate (not limited to what TupleDomain can represent).
During execution, as needed, the full partition metadata is fetched for specific partition names. This happens at runtime because the full metadata can be large and fetching it can be slow.

The problem is that the Glue API only allows fetching full partition metadata -- there is no efficient way to fetch just the partition names. So for the Glue metastore implementation, we fetch the full partition metadata during planning, throw away everything but the name, then fetch the metadata again during execution.

If the Glue API had a way to fetch just partition names, we might not need this segmented fetching, since listing names would (hopefully) be significantly faster.

anoopj · 2019-10-10T19:58:08Z

@david @dain @findepi

That is a good observation and we are aware that this is suboptimal. We currently are working on adding a flag to the Glue GetPartitions API such that it returns only the partition names and the expensive parts like the schema won't have to be read. This would reduce the latency and also allow more partitions to be returned. When this is available in the public AWS APIs and SDK, we can send another PR to make use of that.

I don't think it would obviate the parallel/segmented calls though, since there could be heavily partitioned tables and queries that need to read a lot of partitions. Maybe we could adjust the defaults to be even lower than 5 based on some tests.

I'm a bit overbooked this week and will update the PR with feedback next week hopefully.

anoopj · 2019-10-23T18:06:08Z

@electrum @findepi

Updated the PR incorporating feedback. I've also tested this on a Presto cluster. On a table with about 2000 partitions, getPartitions used to take 2766 ms. With this change, it drops to 650 ms (with hive.metastore.glue.partitions-segments set to 5, which is the default).

On heavily partitioned tables, this can result in an order of magnitude improvement.

anoopj · 2019-11-01T18:13:46Z

@electrum @findepi

Ping.

electrum

A few minor comments, otherwise looks good. There is one minor fix needed for the Travis build.

electrum · 2019-11-01T21:19:55Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

Nit: use a method reference and static import

private static final Comparator<Partition> PARTITION_COMPARATOR = comparing(Partition::getValues, lexicographical(CASE_INSENSITIVE_ORDER));

Will change.

electrum · 2019-11-01T21:20:23Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java

This still needs updating

electrum · 2019-11-01T21:29:10Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastoreConfig.java

The difference is subtle and complicated to explain. I think the description here is fine, though we could document it in the main Hive documentation.

electrum · 2019-11-01T21:31:32Z

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueMetastoreModule.java

This still needs updating

electrum · 2019-11-01T21:32:03Z

presto-hive/src/test/java/io/prestosql/plugin/hive/AbstractTestHive.java

Nit: put throws clause on next line

electrum · 2019-11-01T21:32:39Z

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

Nit: use ImmutableList.of()

electrum · 2019-11-01T21:34:44Z

presto-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestHiveGlueMetastore.java

Create a shared cached thread pool for the test class using @BeforeClass / @AfterClass so that we shut it down, otherwise we can run out of JVM threads when running many tests. See TestThrottledAsyncQueue for an example.

electrum · 2019-11-01T21:40:58Z

...-hive/src/test/java/io/prestosql/plugin/hive/metastore/glue/TestGlueHiveMetastoreConfig.java

These need to be updated to match the new names. This is the cause of the Travis CI failure.

Fixed. Sorry for missing that.

This change parallelizes the partition fetch for the Glue metastore by splitting the partitions into non-overlapping segments[2]. This can speed up query planning by upto an order of magnitude. [1] https://docs.aws.amazon.com/glue/latest/webapi/API_Segment.html

anoopj · 2019-11-05T01:36:11Z

@electrum

Updated the PR with feedback.

electrum · 2019-11-20T01:43:04Z

Merged, thanks!

Document the new configurations introduced as part of #1465 and a few other configs introduced over time. Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>

cla-bot bot added the cla-signed label Sep 6, 2019

anoopj requested review from electrum and findepi and removed request for findepi September 6, 2019 21:45

sopel39 reviewed Sep 7, 2019

View reviewed changes

presto-hive/src/main/java/io/prestosql/plugin/hive/metastore/glue/GlueHiveMetastore.java Outdated Show resolved Hide resolved

findepi reviewed Sep 8, 2019

View reviewed changes

findepi reviewed Sep 13, 2019

View reviewed changes

electrum reviewed Sep 15, 2019

View reviewed changes

findepi reviewed Sep 15, 2019

View reviewed changes

electrum reviewed Oct 9, 2019

View reviewed changes

findepi assigned electrum Nov 1, 2019

electrum approved these changes Nov 1, 2019

View reviewed changes

electrum merged commit c6b34ec into trinodb:master Nov 20, 2019

anoopj mentioned this pull request Nov 20, 2019

Implement Parallel Partition Pruning for Glue Hive Metastore prestodb/presto#13729

Merged

electrum mentioned this pull request Nov 24, 2019

Release notes for 326 #2014

Closed

7 tasks

martint added this to the 326 milestone Nov 27, 2019

hashhar mentioned this pull request May 10, 2020

Document missing AWS Glue configs in Hive connector #3689

Merged

findepi added a commit that referenced this pull request Jun 24, 2020

Document new AWS Glue configs in Hive connector

b13103e

Document the new configurations introduced as part of #1465 and a few other configs introduced over time. Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>

	public @interface ForGlueHiveMetastore
	public @interface ForGlueHiveMetastore {}

Conversation

anoopj commented Sep 6, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi Sep 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi commented Sep 8, 2019

Uh oh!

anoopj commented Sep 12, 2019

Uh oh!

findepi commented Sep 13, 2019

Uh oh!

findepi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anoopj commented Sep 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

findepi commented Sep 14, 2019

Uh oh!

anoopj commented Sep 15, 2019

Uh oh!

electrum commented Sep 15, 2019

Uh oh!

electrum left a comment

Choose a reason for hiding this comment

Uh oh!

findepi Sep 8, 2019 •

edited

Loading

anoopj commented Sep 14, 2019 •

edited

Loading