Support distribution of broadcast table using disk storage in Presto-on-Spark by pgupta2 · Pull Request #15669 · prestodb/presto

pgupta2 · 2021-02-03T01:49:14Z

== Test plan ==

Added unit tests to trigger broadcast join using disk.
Ran verifier for around 100 broadcast join queries.

== RELEASE NOTES ==
General Changes

Add support for distributing broadcast table using permanent storage, thereby removing spark driver from the distribution flow
This feature can be enabled/disabled using 'distribute_broadcast_table_using_disk' session property

Hive Changes

None

linux-foundation-easycla · 2021-02-03T01:49:17Z

The committers are authorized under a signed CLA.

✅ Arjun Gupta (c477f84eb68e2be06fe5c3b51e99db8b87b946cd, c62a7b323446bcd4af8c9a07392b5c0591cf2e3b, f9d6568e0033836c23df7681ec780cf3c5dc5313, 755d0482f85f3135c4c659eda7995bb1c2956b7c)

pgupta2 · 2021-02-03T01:55:19Z

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

This will be disabled by default.

arhimondr

Generally looks good to me, some comments

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

arhimondr · 2021-02-05T22:10:46Z

presto-main/src/main/java/com/facebook/presto/spiller/LocalTempStorage.java

LocalTempStorage is expected to be used to store files locally, and generally local files are not expected to be remotely accessible and are not expected to survive the restart.

For broadcast only a remote storage service can be used, and files should survive the restart.

What do you think about adding a getCapabilities method to the StorageService interface that would return a Set<StorageCapabilities>, where capabilities could be an enum, e.g.:

StorageCapabilities { REMOTELY_ACCESSIBLE, PERSISTANT_BEETWEEN_RESTARTS, }

Then the broadcast implementation should check if the configured storage provides required capabilities.

CC: @wenleix @viczhang861

arhimondr · 2021-02-05T22:18:05Z

presto-main/src/main/java/com/facebook/presto/spiller/LocalTempStorage.java

Originally the api of this interface used to use URI or Path as an identifier. But then we realized that some storages (e.g.: Manifold) may not have primary keys in a form of a URL. That's why we decided to go with an opaque StorageHandle. But as you have noticed the StorageHandle is not serializable. We thought about adding a serialization methods later (as for the purpose of spilling serialization wasn't needed).

What do you think about adding 2 methods to the storage interface:

byte[] serializeHandle(StorageHandle storageHandle) StorageHandle deserialize(byte[] serializedStorageHandle)

?

Sure.. The problem with storage handle was that it cannot be serialized. I fixed it by adding a new API that accepts URI which is serializable (under the assumption that that every storage will have URI). Ideal option was to make StorageHandle serializable which will work for any storage implementations

StorageHandle is just a marker interface. It might be hard to enforce that every implementation is Java serializable. Though I suggest to add 2 explicit methods to the StorageService interface to serialize / deserialize handles.

arhimondr · 2021-02-06T00:27:15Z

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

Please back that by a configuration property as well

arhimondr · 2021-02-06T00:32:09Z

...-base/src/main/java/com/facebook/presto/spark/PrestoSparkMemoryBasedBroadcastDependency.java

nit: extract the RddAndMore into a standalone class, since not it is used not only by the PrestoSparkQueryExecutionFactory

arhimondr · 2021-02-06T01:35:06Z

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

Sometimes pages could be very small. From what I remember the storage implementation is not required to buffer. If that's the case - we should buffer up to some amount here. For PageFile format we buffer up to 24mb by default (that's an optimal block size for tempfs, but it is configurable).

CC: @wenleix @viczhang861

arhimondr · 2021-02-06T01:35:36Z

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

nit: UncheckedIOException

arhimondr · 2021-02-06T01:38:28Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkQueryExecutionFactory.java

nit: ditto about wrapping

arhimondr · 2021-02-06T01:39:51Z

...rk-base/src/main/java/com/facebook/presto/spark/PrestoSparkDiskBasedBroadcastDependency.java

Let's make the broadcastId a String and use a UUID, that is simpler and more reliable. After you use a local temporary file as a cache the requirement for the ID to be a long will no longer be there

This broadcastId is being used to cache HT in spark's blockmanager. The blockmanager API requires us to pass a BlockID object which could be of different type. BroadcastBlockId class accepts a long identifier. There is another blockId called TestBlockId that accepts a string. We can use it but its sounded counter-intuitive since it is meant for testing purposes.

We cannot extend BlockId class since its sealed and we cannot add a new class since we depend on OSS spark artifacts. So, either we use what we have now or we can use TestBlockId which will accept string as input.

arhimondr · 2021-02-06T01:41:00Z

...rk-base/src/main/java/com/facebook/presto/spark/PrestoSparkDiskBasedBroadcastDependency.java

let's wrap remove in a try-catch

arhimondr

Some comments

arhimondr · 2021-02-11T01:06:17Z

presto-main/src/main/java/com/facebook/presto/spiller/LocalTempStorage.java

You can simply do uri.toString().getBytes(UTF_8)

arhimondr · 2021-02-11T01:07:14Z

presto-main/src/main/java/com/facebook/presto/spiller/LocalTempStorage.java

In Presto we prefer using airlift.json. You can have a look at JsonMapper. But for the purpose of serialization for this class we can simply use uri.toString()

arhimondr · 2021-02-11T01:08:21Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkConfig.java

nit: maybe storageBasedBroadcastJoinEnabled (also change the config names)

arhimondr · 2021-02-11T01:08:44Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkConfig.java

nit: maybe storageBasedBroadcastJoinWriteBufferSize (also change the config names)

arhimondr · 2021-02-11T01:10:43Z

presto-spark-base/src/main/java/com/facebook/presto/spark/PrestoSparkSessionProperties.java

nit: if you decide to rename config names please don't forget to rename the session properties to keep the naming consistent

arhimondr · 2021-02-11T02:05:02Z

...o-spark-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkDiskPageInput.java

Since now we have control over caching we need to reserve the memory used by the cache using the Presto memory reservation mechanism. It might be somehow tricky though. Presto offers LocalMemoryContext that is "per operator". There will be several instances of the PrestoSparkRemoteSourceOperator for a single plan node, though the memory for caching should be accounted only once.

arhimondr · 2021-02-11T02:07:07Z

...o-spark-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkDiskPageInput.java

Ideally it would be best to cache already deserialized pages. Though deserialization should be done outside the lock. Only reading pages from the input stream must be done under the lock.

arhimondr · 2021-02-11T02:08:07Z

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

don't forget to change the config here (to use the broadcast one)

arhimondr · 2021-02-11T02:10:28Z

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

Make sure it is closed if the execution terminates exceptionally (it may perform some cleanups on close)

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

arhimondr

Generally looks good. A whole bunch of nits though. We should be good to go once the comments are resolved.

arhimondr · 2021-02-17T15:22:38Z

...src/main/java/com/facebook/presto/spark/execution/PrestoSparkBroadcastTableCacheManager.java

static final

Its a singleton class. Do we still need to make it static?

Oh, right. That's a good point. Yeah, we don't have to make it static. Let's make it final though.

arhimondr · 2021-02-17T15:23:16Z

...src/main/java/com/facebook/presto/spark/execution/PrestoSparkBroadcastTableCacheManager.java

This might cause the ConcurrentModificationException. Create a copy of the key set before iterating.

arhimondr · 2021-02-17T15:24:54Z

...src/main/java/com/facebook/presto/spark/execution/PrestoSparkBroadcastTableCacheManager.java

nit: I would recommend removing this method as it is not very clear what size does it return (e.g.: number of pages, number of cached stages, size in bytes, etc.)

arhimondr · 2021-02-17T15:27:47Z

...src/main/java/com/facebook/presto/spark/execution/PrestoSparkBroadcastTableCacheManager.java

The ClassLayout.parseClass(BroadcastTableCacheKey.class).instanceSize() will only include the size of the BroadcastTableCacheKey object itself, it won't include the size of the stageId and planNodeId objects. Since the memory footprint of the cache keys are rather very small I would recommend simply not count for the keys memory usage for simplicity.

arhimondr · 2021-02-17T15:29:02Z

...src/main/java/com/facebook/presto/spark/execution/PrestoSparkBroadcastTableCacheManager.java

This method will be called on every page. I would recommend caching the current retained size of a cache in a private static long variable, and recompute the size in the cache method

arhimondr · 2021-02-17T16:25:16Z

...rk-base/src/main/java/com/facebook/presto/spark/PrestoSparkDiskBasedBroadcastDependency.java

Drop the String.format(, the Airlift logger can do formatting

arhimondr · 2021-02-17T16:31:51Z

...o-spark-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkDiskPageInput.java

ImmutableList.builder()

arhimondr · 2021-02-17T16:36:04Z

...o-spark-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkDiskPageInput.java

This could potentially be immutable or shared, I would recommend creating a copy, and then doing shuffle

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

arhimondr · 2021-02-17T16:40:04Z

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

remove the requireNonNull check

arhimondr

LGTM % comments

arhimondr · 2021-02-18T16:30:12Z

presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionManager.java

It seems like a rebase artifact

arhimondr · 2021-02-18T16:36:07Z

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

Check for ioException != exception, as self suppression is not allowed

arhimondr · 2021-02-18T16:36:28Z

...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java

arhimondr · 2021-02-18T16:41:50Z

...src/main/java/com/facebook/presto/spark/execution/PrestoSparkBroadcastTableCacheManager.java

as an afterthought: Although the cache is currently used always from a single thread, since it is a singleton it might be accidentally used from different threads. I would recommend you protecting all the public methods with synchronized to be on the safe side.

Spark provided broadcast variables are not scalable enough to reliably distribute large volumes of data (gigabytes). Also broadcast variables cause additional memory overhead, as serialized blocks have to be stored in memory for the torrent algorithm to function. This implementation uses distributed storage as a medium. It stores the broadcast data into a distributed storage and then broadcasts only the pointers (usually file names) with a Broadcast variable to let the executors know where to read the broadcasted data. This PR doesn't provide a StorageService implementation. Testing is done in a local mode with a local file system as a local storage.

pgupta2 requested a review from arhimondr February 3, 2021 01:54

pgupta2 commented Feb 3, 2021

View reviewed changes

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java Outdated

Copy link

Contributor Author

pgupta2 Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be disabled by default.

pgupta2 force-pushed the broadcast_using_checkpoint branch 2 times, most recently from 5677060 to f77c9d3 Compare February 5, 2021 00:31

arhimondr reviewed Feb 6, 2021

View reviewed changes

arhimondr reviewed Feb 11, 2021

View reviewed changes

pgupta2 force-pushed the broadcast_using_checkpoint branch 2 times, most recently from 3dfc264 to ad52305 Compare February 17, 2021 07:35

arhimondr reviewed Feb 17, 2021

View reviewed changes

arhimondr requested a review from viczhang861 February 17, 2021 16:51

pgupta2 force-pushed the broadcast_using_checkpoint branch from ad52305 to 755d048 Compare February 18, 2021 07:24

arhimondr approved these changes Feb 18, 2021

View reviewed changes

pgupta2 force-pushed the broadcast_using_checkpoint branch 3 times, most recently from 700f802 to 427fd03 Compare February 24, 2021 21:29

pgupta2 force-pushed the broadcast_using_checkpoint branch from 427fd03 to ab26717 Compare February 26, 2021 19:55

arhimondr merged commit bd93326 into prestodb:master Mar 1, 2021

arhimondr mentioned this pull request Mar 12, 2021

Add release notes for 0.249 #15831

Merged

13 tasks

ajaygeorge mentioned this pull request Mar 15, 2021

[TEST] Add release notes for 0.249 #15835

Closed

9 tasks

Conversation

pgupta2 commented Feb 3, 2021

Uh oh!

linux-foundation-easycla bot commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgupta2 Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

linux-foundation-easycla bot commented Feb 3, 2021 •

edited

Loading

pgupta2 Feb 10, 2021 •

edited

Loading