Support distribution of broadcast table using disk storage in Presto-on-Spark#15669
Conversation
There was a problem hiding this comment.
This will be disabled by default.
5677060 to
f77c9d3
Compare
arhimondr
left a comment
There was a problem hiding this comment.
Generally looks good to me, some comments
presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
LocalTempStorage is expected to be used to store files locally, and generally local files are not expected to be remotely accessible and are not expected to survive the restart.
For broadcast only a remote storage service can be used, and files should survive the restart.
What do you think about adding a getCapabilities method to the StorageService interface that would return a Set<StorageCapabilities>, where capabilities could be an enum, e.g.:
StorageCapabilities {
REMOTELY_ACCESSIBLE,
PERSISTANT_BEETWEEN_RESTARTS,
}
Then the broadcast implementation should check if the configured storage provides required capabilities.
CC: @wenleix @viczhang861
There was a problem hiding this comment.
Originally the api of this interface used to use URI or Path as an identifier. But then we realized that some storages (e.g.: Manifold) may not have primary keys in a form of a URL. That's why we decided to go with an opaque StorageHandle. But as you have noticed the StorageHandle is not serializable. We thought about adding a serialization methods later (as for the purpose of spilling serialization wasn't needed).
What do you think about adding 2 methods to the storage interface:
byte[] serializeHandle(StorageHandle storageHandle)
StorageHandle deserialize(byte[] serializedStorageHandle)
?
There was a problem hiding this comment.
Sure.. The problem with storage handle was that it cannot be serialized. I fixed it by adding a new API that accepts URI which is serializable (under the assumption that that every storage will have URI). Ideal option was to make StorageHandle serializable which will work for any storage implementations
There was a problem hiding this comment.
StorageHandle is just a marker interface. It might be hard to enforce that every implementation is Java serializable. Though I suggest to add 2 explicit methods to the StorageService interface to serialize / deserialize handles.
There was a problem hiding this comment.
Please back that by a configuration property as well
There was a problem hiding this comment.
nit: extract the RddAndMore into a standalone class, since not it is used not only by the PrestoSparkQueryExecutionFactory
There was a problem hiding this comment.
Sometimes pages could be very small. From what I remember the storage implementation is not required to buffer. If that's the case - we should buffer up to some amount here. For PageFile format we buffer up to 24mb by default (that's an optimal block size for tempfs, but it is configurable).
CC: @wenleix @viczhang861
There was a problem hiding this comment.
Let's make the broadcastId a String and use a UUID, that is simpler and more reliable. After you use a local temporary file as a cache the requirement for the ID to be a long will no longer be there
There was a problem hiding this comment.
This broadcastId is being used to cache HT in spark's blockmanager. The blockmanager API requires us to pass a BlockID object which could be of different type. BroadcastBlockId class accepts a long identifier. There is another blockId called TestBlockId that accepts a string. We can use it but its sounded counter-intuitive since it is meant for testing purposes.
We cannot extend BlockId class since its sealed and we cannot add a new class since we depend on OSS spark artifacts. So, either we use what we have now or we can use TestBlockId which will accept string as input.
There was a problem hiding this comment.
let's wrap remove in a try-catch
There was a problem hiding this comment.
You can simply do uri.toString().getBytes(UTF_8)
There was a problem hiding this comment.
In Presto we prefer using airlift.json. You can have a look at JsonMapper. But for the purpose of serialization for this class we can simply use uri.toString()
There was a problem hiding this comment.
nit: maybe storageBasedBroadcastJoinEnabled (also change the config names)
There was a problem hiding this comment.
nit: maybe storageBasedBroadcastJoinWriteBufferSize (also change the config names)
There was a problem hiding this comment.
nit: if you decide to rename config names please don't forget to rename the session properties to keep the naming consistent
There was a problem hiding this comment.
Since now we have control over caching we need to reserve the memory used by the cache using the Presto memory reservation mechanism. It might be somehow tricky though. Presto offers LocalMemoryContext that is "per operator". There will be several instances of the PrestoSparkRemoteSourceOperator for a single plan node, though the memory for caching should be accounted only once.
There was a problem hiding this comment.
Ideally it would be best to cache already deserialized pages. Though deserialization should be done outside the lock. Only reading pages from the input stream must be done under the lock.
There was a problem hiding this comment.
don't forget to change the config here (to use the broadcast one)
There was a problem hiding this comment.
Make sure it is closed if the execution terminates exceptionally (it may perform some cleanups on close)
...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java
Outdated
Show resolved
Hide resolved
3dfc264 to
ad52305
Compare
arhimondr
left a comment
There was a problem hiding this comment.
Generally looks good. A whole bunch of nits though. We should be good to go once the comments are resolved.
There was a problem hiding this comment.
Its a singleton class. Do we still need to make it static?
There was a problem hiding this comment.
Oh, right. That's a good point. Yeah, we don't have to make it static. Let's make it final though.
There was a problem hiding this comment.
This might cause the ConcurrentModificationException. Create a copy of the key set before iterating.
There was a problem hiding this comment.
nit: I would recommend removing this method as it is not very clear what size does it return (e.g.: number of pages, number of cached stages, size in bytes, etc.)
There was a problem hiding this comment.
The ClassLayout.parseClass(BroadcastTableCacheKey.class).instanceSize() will only include the size of the BroadcastTableCacheKey object itself, it won't include the size of the stageId and planNodeId objects. Since the memory footprint of the cache keys are rather very small I would recommend simply not count for the keys memory usage for simplicity.
There was a problem hiding this comment.
This method will be called on every page. I would recommend caching the current retained size of a cache in a private static long variable, and recompute the size in the cache method
There was a problem hiding this comment.
Drop the String.format(, the Airlift logger can do formatting
There was a problem hiding this comment.
This could potentially be immutable or shared, I would recommend creating a copy, and then doing shuffle
...k-base/src/main/java/com/facebook/presto/spark/execution/PrestoSparkTaskExecutorFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
remove the requireNonNull check
ad52305 to
755d048
Compare
There was a problem hiding this comment.
It seems like a rebase artifact
There was a problem hiding this comment.
Check for ioException != exception, as self suppression is not allowed
There was a problem hiding this comment.
as an afterthought: Although the cache is currently used always from a single thread, since it is a singleton it might be accidentally used from different threads. I would recommend you protecting all the public methods with synchronized to be on the safe side.
700f802 to
427fd03
Compare
Spark provided broadcast variables are not scalable enough to reliably distribute large volumes of data (gigabytes). Also broadcast variables cause additional memory overhead, as serialized blocks have to be stored in memory for the torrent algorithm to function. This implementation uses distributed storage as a medium. It stores the broadcast data into a distributed storage and then broadcasts only the pointers (usually file names) with a Broadcast variable to let the executors know where to read the broadcasted data. This PR doesn't provide a StorageService implementation. Testing is done in a local mode with a local file system as a local storage.
427fd03 to
ab26717
Compare
== Test plan ==
== RELEASE NOTES ==
General Changes
This feature can be enabled/disabled using 'distribute_broadcast_table_using_disk' session property
Hive Changes