Conversation
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/cache/CachingFileSystemConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/scheduler/NodeSchedulerConfig.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/CachingHdfsInputFile.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/cache/CachingFileSystemConfig.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/cache/CachingFileSystemConfig.java
Outdated
Show resolved
Hide resolved
3325fd5 to
659298d
Compare
|
would also love to see this or 16375 merged. |
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/CachingHdfsInputFile.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/hdfs/util/ConsistentHashingNodeProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Since we use soft scheduling, this might end up unnecessarily caching data on nodes whenever soft scheduling fails. That is, if a piece of data A is supposed to be cached on node 1, but node 1 is too busy, it might be scheduled to run on node 2. Then node 2 will read A from cloud storage, and add it to its cache. However, the next time data A is supposed to be read, and node 1 is still too busy, we are just as likely to schedule the read to happen on node 3. So we will just end up consuming write capacity and cache capacity from node 2, without really getting any benefit of caching. This will especially be a problem in a cluster with a lot of nodes.
The nodes need a way to know if they are supposed to cache a piece of data or not, and fall back to not caching.
There was a problem hiding this comment.
That might be possible to mitigate by populating a list of preferred host addresses in the connector split instead of just one host
There was a problem hiding this comment.
Would this potentially reduce the cache hit rate? For instance, let's say we have 100 nodes, each with a 1 TB cache, a 100 TB of data we want to query, and all data is only cached on a single node. If soft scheduling never fails we will eventually have a 100% cache hit rate. If soft scheduling fails 10% of the time, 10% of the data will be scheduled on a random node not supposed to cache this data, and store "garbage" in its cache. Eventually 10% of all cached data is not cached on a node with the correct cache key. This should give roughly an 80% cache hit rate.
If on the other hand we cache all data on two nodes, then only 50 TB of the data we want to query can be cached. This should give a cache hit rate of 50%, even if soft scheduling never fails.
If nodes only cache data they are supposed to cache, we would get an 90% cache hit rate, if soft scheduling fails 10% of the time.
There was a problem hiding this comment.
I don't think this is a problem as long as you deterministically provide the same list of preferred host addressees for a given split and the scheduler attempts to schedule splits on these hosts in the provided preferred order. I'm assuming that the probability that all the preferred hosts are too busy to scheduled splits is low with 3 preferred hosts. You might cache some data in multiple nodes but that would be useful when the primary preferred node is too busy.
A couple of alternatives to this are:
- The worker node checks that it is not the preferred host from the split and uses the non-cached file system implementation when it's not. This would give up on caching if the preferred host is too busy to schedule splits on.
- The embedded cache implementation itself has the the ability to remotely read cached data from any worker. This way even if a split gets scheduled on a node which didn't cache the data, it can still read the cached data from another node. This is the approach that Rubix takes. However, this is probably not worth the added complexity.
lib/trino-hdfs/src/main/java/io/trino/hdfs/util/NodeProvider.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/hdfs/util/NodeProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
It's a bit unfortunate that we have to do the caching on the file level, and not on the part-of-file level. With the current caching scheme a single large file from cloud storage will be cached on a single node. For a large and hot file this could potentially lead to problems. But it might be hard to solve, and might not be worth it.
There was a problem hiding this comment.
@beinan could you please weigh in on whether this is possible to improve ?
|
An update on this here - based on the feedback above, we're reworking the PR to:
We're planning to polish it a bit internally, then bring it up here for discussion probably next week. |
|
Amazing thank you for this work. |
659298d to
d5118f6
Compare
|
Hey, a small update here. We have just pushed our latest changes with the promised refactor. There is likely still some work to do, especially around testing. Still, we hope this code structure can be a good point for continuing with the review. To summarize some of the changes and open questions:
If anyone wants to test this PR with other connectors, please be aware that the |
d5118f6 to
5684fe2
Compare
|
Thanks for your work on this. Can you move Is the Alluxio cache code fundamentally tied to Hadoop? I see |
5684fe2 to
8132290
Compare
I've moved it to a new module |
|
@jkylling can you rebase to resolve conflict? |
Yes, I'm looking at it. It's a bit tricky as the alluxio-shaded client contains a lot of classes which can overlap with other dependencies, and rebasing on master have brought about a new set of duplicate resources. I'm trying to switch to use non-shaded Alluxio libraries instead. |
|
@jkylling we should use shaded version if possible |
mosabua
left a comment
There was a problem hiding this comment.
Just a few suggestions for the javadocs while I write on documentation PR.
...em-cache-alluxio/src/main/java/io/trino/filesystem/alluxio/AlluxioFileSystemCacheConfig.java
Outdated
Show resolved
Hide resolved
...em-cache-alluxio/src/main/java/io/trino/filesystem/alluxio/AlluxioFileSystemCacheConfig.java
Outdated
Show resolved
Hide resolved
...em-cache-alluxio/src/main/java/io/trino/filesystem/alluxio/AlluxioFileSystemCacheConfig.java
Outdated
Show resolved
Hide resolved
...em-cache-alluxio/src/main/java/io/trino/filesystem/alluxio/AlluxioFileSystemCacheConfig.java
Outdated
Show resolved
Hide resolved
|
Btw this is the extension of this PR for Iceberg connector incase somebody is interested: amoghmargoor@cdefc91 |
We definitely want to get this into the project as soon as we can as well. Ideally also for Hive and Hudi .. but best to do it after this PR merges. Hopefully soon. |
Makes sense. Looking forward. Thanks! |
|
thanks for all the hard work on this, we look forward to this being ready and pushed in. This will be a huge feature that is needed. |
We've been running with an earlier iteration of this PR for a while now, and see equal or better performance, depending on the volumes used for caching. With EBS we have seen equal performance, while with NVMe disks on the nodes we have seen maybe a 10%-20% speed up on some benchmark workloads (@Pluies has the exact numbers). Typically we see worse performance when the cache is populated, and then slightly better performance when the cache is warm. I did a a very unscientific benchmark with a build using the latest iteration of this PR and got the numbers below (1 coordinator + 2 workers, m6g.8xlarge, 30 CPU, 115 Gi, each with a 2000 Gi EBS volume for caching): That is, with a cold cache, the scan of 84 GB takes ~38 seconds, with a warm cache it takes ~ 22 seconds, and without a cache it takes ~33 seconds. |
|
Please rebase to latest master as well |
...em-cache-alluxio/src/main/java/io/trino/filesystem/alluxio/AlluxioFileSystemCacheModule.java
Outdated
Show resolved
Hide resolved
| private DataSize cachePageSize = DataSize.valueOf("1MB"); | ||
|
|
||
| @NotNull | ||
| public List<@FileExists String> getCacheDirectories() |
There was a problem hiding this comment.
I'm not sure that checking for existence here is correct as a previously working config fails now.
2024-01-31T14:51:10.4367179Z 2024-01-31T14:51:06.347Z ERROR main io.trino.server.Server Configuration errors:
2024-01-31T14:51:10.4367195Z
2024-01-31T14:51:10.4368303Z 1) Error: Invalid configuration property with prefix '': file does not exist: /opt/data (for class AlluxioFileSystemCacheConfig.cacheDirectories[0].<list element>)
2024-01-31T14:51:10.4368323Z
2024-01-31T14:51:10.4368437Z 1 error
2024-01-31T14:51:10.4368443Z
2024-01-31T14:51:10.4368577Z ======================
2024-01-31T14:51:10.4368712Z Full classname legend:
2024-01-31T14:51:10.4368848Z ======================
2024-01-31T14:51:10.4369391Z AlluxioFileSystemCacheConfig: "io.trino.filesystem.alluxio.AlluxioFileSystemCacheConfig"
2024-01-31T14:51:10.4369533Z ========================
2024-01-31T14:51:10.4369672Z End of classname legend:
2024-01-31T14:51:10.4369802Z ========================
2024-01-31T14:51:10.4369808Z
2024-01-31T14:51:10.4370288Z io.airlift.bootstrap.ApplicationConfigurationException: Configuration errors:
2024-01-31T14:51:10.4370378Z
2024-01-31T14:51:10.4371486Z 1) Error: Invalid configuration property with prefix '': file does not exist: /opt/data (for class AlluxioFileSystemCacheConfig.cacheDirectories[0].<list element>)
2024-01-31T14:51:10.4371493Z
2024-01-31T14:51:10.4371626Z 1 error
2024-01-31T14:51:10.4371631Z
2024-01-31T14:51:10.4371749Z ======================
2024-01-31T14:51:10.4371898Z Full classname legend:
2024-01-31T14:51:10.4372014Z ======================
2024-01-31T14:51:10.4372575Z AlluxioFileSystemCacheConfig: "io.trino.filesystem.alluxio.AlluxioFileSystemCacheConfig"
2024-01-31T14:51:10.4372700Z ========================
2024-01-31T14:51:10.4372854Z End of classname legend:
2024-01-31T14:51:10.4372988Z ========================
2024-01-31T14:51:10.4372994Z
2024-01-31T14:51:10.4373348Z at io.airlift.bootstrap.Bootstrap.configure(Bootstrap.java:217)
2024-01-31T14:51:10.4373726Z at io.airlift.bootstrap.Bootstrap.initialize(Bootstrap.java:246)
2024-01-31T14:51:10.4374592Z at io.trino.plugin.deltalake.InternalDeltaLakeConnectorFactory.createConnector(InternalDeltaLakeConnectorFactory.java:112)
2024-01-31T14:51:10.4375225Z at io.trino.plugin.deltalake.DeltaLakeConnectorFactory.create(DeltaLakeConnectorFactory.java:56)
Does alluxio create the directory automatically if it doesn't exist ?
All tests are green even though Trino won't start with alluxio cache configuration right now. That's why we need an end-to-end test in this PR.
There was a problem hiding this comment.
Alluxio will create the directory if it does not exist. I've updated the validation mechanism to check if any parent directory is writable.
I've added a product test which tests the configuration and the caching.
I encountered some problems when running the product tests. The Trino nodes launched by the product tests crashed with an error message saying that Java was not installed. The root cause turned out to be that the JDK downloaded by the product test to /tmp/ptl-tmp-download/temurin21-amd64 was corrupted and missing the library libjli.so. Deleting /tmp/ptl-tmp-download solved the issue. I'm not exactly sure how this state came about, but thought I'd write it down for posterity.
The error message on java.nio.file.NoSuchFileException: /opt/data/LOCAL seems to come from Alluxio's TTL mechanism. From what I can tell, this code will always throw when encountering an empty cache. It's not entirely clear if it's innocent or not, as it might cause the scheduleAtFixedRate to stop executing.
raunaqmorarka
left a comment
There was a problem hiding this comment.
You can squash last two commits
There was a problem hiding this comment.
Is the TODO "CheckpointEntryIterator is always closed" still pending ? If it is, please include a link to an open GH issue
There was a problem hiding this comment.
It should be resolved by #20054 now. I'll enable the caching of checkpoint files.
There was a problem hiding this comment.
Please make sure we include some test with checkpoint files in delta if we're doing that in this PR.
There was a problem hiding this comment.
The TestDeltaLakeAlluxioCacheFileOperations does have coverage for the cache operations for checkpoint files, but it's unclear how much coverage on checkpoints we get through the TestDeltaLakeAlluxioCacheMinioAndHmsConnectorSmokeTest test. It might be prudent to just leave it as is, and consider it as a future optimization. I'll keep the code as is, and leave a TODO to enable caching of checkpoint files.
Enabling caching of checkpoint files could make the delta.checkpoint-filtering.enabled=true feature work better. When we tried enabling this feature earlier we saw a major slow down of the planning and analysis phases. The in-memory cache of checkpoints seemed to no longer be used, and lots of queries were made by the coordinator to object storage to fetch checkpoint files. However, if we only need to read the checkpoint files from disk it could work without the in-memory checkpoint cache.
...n/java/io/trino/tests/product/launcher/env/environment/EnvMultinodeMinioDataLakeCaching.java
Outdated
Show resolved
Hide resolved
...roduct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeAlluxioCaching.java
Outdated
Show resolved
Hide resolved
|
Ran SF1000 TPC benchmark on 6 r6gd.8xlarge workers
Results on TPC look pretty good, there is significant reduction in wall time and some CPU time reduction. As a follow-up we should look at exposing bytes read from cache as a connector metric, this will make it easy to see usage of the cache for each table scan in a query in output of EXPLAIN ANALYZE VERBOSE, queryinfo json, event listener metrics etc. |
|
I'm going to merge Rubix drop in a while (#20102), so we can rebase and drop the conflict resolution commit from this PR. |
|
@jkylling please rephrase Rubix commit. We've decided internally to remove Rubix when we implement Hive caching with Alluxio. We will merge this PR and add Hive support in a separate change |
Co-authored-by: Florent Delannoy <florent.delannoy@gmail.com>
Co-authored-by: Florent Delannoy <florent.delannoy@gmail.com>
Co-authored-by: Florent Delannoy <florent.delannoy@gmail.com>
Co-authored-by: Florent Delannoy <florent.delannoy@gmail.com>
Co-authored-by: Florent Delannoy <florent.delannoy@gmail.com>
electrum
left a comment
There was a problem hiding this comment.
See comments, otherwise looks good
| <groupId>org.alluxio</groupId> | ||
| <artifactId>alluxio-core-common</artifactId> | ||
| <version>${dep.alluxio.version}</version> | ||
| <exclusions> |
There was a problem hiding this comment.
Can we file an issue to fix these in Alluxio?
| DistributedQueryRunner queryRunner = DistributedQueryRunner.builder(session) | ||
| .build(); | ||
| try { | ||
| File metastoreDirectory = queryRunner.getCoordinator().getBaseDataDir().resolve("delta_lake_metastore").toFile().getAbsoluteFile(); |
There was a problem hiding this comment.
We're trying to migrate the testing code so that
- we create connectors using properties rather than hand wiring
- we use OpenTelemetry tracing rather than custom tracking code
For example, 458bfd7 replaced a custom metastore wrapper with a getSpans() method on DistributedQueryrunner (take a look at the assertMetastoreInvocationsForQuery() utility method). If we can do a similar thing here, the test construction becomes simpler and easier to maintain, and we know that we're testing the same code that runs in production.
I'd like to remove TrackingFileSystemFactory, so it's best if we don't introduce more usages of it.
| <scope>runtime</scope> | ||
| </dependency> | ||
|
|
||
| <dependency> |
There was a problem hiding this comment.
If we can convert TestDeltaLakeAlluxioCacheFileOperations to construct the connector with properties and use tracing (per my other comment), then we should be able to remove these runtime dependencies.
|
|
||
| import java.util.List; | ||
|
|
||
| public class NoneCachingHostAddressProvider |
There was a problem hiding this comment.
I think NoCachingHostAddressProvider would sound better
| factories.addBinding("gs").to(GcsFileSystemFactory.class); | ||
| } | ||
|
|
||
| newOptionalBinder(binder, CachingHostAddressProvider.class).setDefault().to(NoneCachingHostAddressProvider.class).in(Scopes.SINGLETON); |
There was a problem hiding this comment.
Rather than using an optional binder here, the configured cache implementation should bind the implementation. So the NONE cache should install NoCachingHostAddressProvider and ALLUXIO should install ConsistentHashingHostAddressProvider.
| public Optional<String> getCacheKey(TrinoInputFile delegate) | ||
| throws IOException | ||
| { | ||
| return Optional.of(delegate.location().path() + delegate.lastModified()); |
There was a problem hiding this comment.
This should use a separator that won't appear in file names. Otherwise, we could have a collision with a filename ending in a number.
| public Optional<String> getCacheKey(TrinoInputFile delegate) | ||
| { | ||
| // TODO: Consider caching of the Parquet checkpoint files within _delta_log | ||
| if (!delegate.location().path().contains("/_delta_log/")) { |
There was a problem hiding this comment.
Why do we skip caching this directory?
There was a problem hiding this comment.
The _delta_log directory contains the files _last_checkpoint and _trino_meta/extended_stats.json. These are not immutable, so are tricky to cache. Also the commit files of the form 000...123.json might not be immutable on ABFS. The checkpoint files should be immutable when accessed by Trino. We decided to leave it as a future optimization in #18719 (comment)
|
@colebow .. release notes entry should link to the docs and maybe just say Add support for filesystem caching Same for the other incoming PRs for Iceberg, Hive, and Hudi |


Description
👋
This PR includes Alluxio caching into Trino. It is a reworking of #16375 with:
optimized-local-schedulingrather than introducing a new concept of node affinitylastModifiedTimefrom the coordinator to the workers via ConnectorSplit to allow for immutable caching without workers having to maintain their own file status cacheTestDeltaLakeMinioAndHmsConnectorSmokeTestintoTestCachingDeltaLakeMinioAndHmsConnectorSmokeTestAdditional context and related issues
This is very much #16375 from @beinan reworked after helpful feedback from @raunaqmorarka (thank you! 🙏 ).
I'm putting it out there so we can discuss next steps on integrating this to Trino 🥳 with the following notes:
optimized-local-schedulingmeans each connector has to implement split scheduling by specifying anaddresson the splits it generates. Otherwise, each split will randomly be assigned to a worker node, and the cache won't be distributed.lastModifiedTimefrom the underlying file system, and use the cached file directly if availabletrino-delta-lake, the connector I'm developing this for, but will need to be implemented separately for other connectors to take full advantage of AlluxioRelease notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: