Route splits by affinity key in the scheduler#29182
Route splits by affinity key in the scheduler#29182raunaqmorarka wants to merge 7 commits intomasterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29e4195738
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
3830fbe to
d8de793
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughThis pull request implements a distributed scheduler optimization that replaces host-address based split placement with cache-aware split affinity routing. The change moves consistent hashing logic from the filesystem caching layer to the core scheduler, introducing a new Possibly related PRs
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveSplit.java (1)
77-105:⚠️ Potential issue | 🟡 MinorAssert the affinity key in the JSON round-trip.
Line 77 seeds a non-empty affinity key, but the test never verifies it after deserialization. If the JSON property is dropped or deserialized as empty, this test would still pass.
Proposed test assertion
assertThat(actual.getBucketConversion()).isEqualTo(expected.getBucketConversion()); assertThat(actual.isForceLocalScheduling()).isEqualTo(expected.isForceLocalScheduling()); + assertThat(actual.getAffinityKey()).isEqualTo(expected.getAffinityKey()); assertThat(actual.getAcidInfo().get()).isEqualTo(expected.getAcidInfo().get()); assertThat(actual.getSplitWeight()).isEqualTo(expected.getSplitWeight());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveSplit.java` around lines 77 - 105, The test seeds a non-empty affinity key on the HiveSplit but never verifies it after JSON round-trip; add an assertion comparing the deserialized split's affinity key to the original. Specifically, after creating expected and actual (via codec.toJson/fromJson), add an assertion that actual.getAffinityKey().equals(expected.getAffinityKey()) (or use assertThat(actual.getAffinityKey()).isEqualTo(expected.getAffinityKey()) consistent with other assertions) to ensure the affinity key survives serialization/deserialization.plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplit.java (1)
48-100:⚠️ Potential issue | 🟠 MajorInclude
affinityKeyin retained-size accounting.The key is stored on coordinator split objects and can contain path-derived strings, so queued split memory is now under-counted.
Proposed fix
public long getRetainedSizeInBytes() { return INSTANCE_SIZE + estimatedSizeOf(path) + sizeOf(fileRowCount, value -> LONG_INSTANCE_SIZE) + sizeOf(deletionVector, DeletionVectorEntry::sizeInBytes) + + sizeOf(affinityKey, SizeOf::estimatedSizeOf) + splitWeight.getRetainedSizeInBytes() + statisticsPredicate.getRetainedSizeInBytes(DeltaLakeColumnHandle::retainedSizeInBytes) + estimatedSizeOf(partitionKeys, SizeOf::estimatedSizeOf, value -> sizeOf(value, SizeOf::estimatedSizeOf)); }Also applies to: 178-188
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplit.java` around lines 48 - 100, The retained-size calculation for DeltaLakeSplit misses the affinityKey field, undercounting memory for coordinator split objects; update the retained size computation (the retainedSize method in DeltaLakeSplit) to account for the Optional<String> affinityKey by adding its retained size when present (include the string object's size plus Optional overhead as you handle other Optional fields like fileRowCount/deletionVector), ensuring affinityKey is included both in the primary retainedSize implementation and the alternative branch referenced around lines 178-188 so queued split memory is correctly tallied.plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplit.java (1)
54-150:⚠️ Potential issue | 🟠 MajorAccount for
affinityKeyingetRetainedSizeInBytes().The split now retains an optional path-derived affinity key, but the memory estimate still only includes addresses and other existing fields.
Proposed fix
+ schema.getRetainedSizeInBytes() + estimatedSizeOf(partitionKeys, HivePartitionKey::estimatedSizeInBytes) + estimatedSizeOf(addresses, HostAddress::getRetainedSizeInBytes) + + sizeOf(affinityKey, value -> estimatedSizeOf(value)) + estimatedSizeOf(partitionName) + sizeOf(readBucketNumber) + sizeOf(tableBucketNumber)Also applies to: 280-296
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplit.java` around lines 54 - 150, getRetainedSizeInBytes currently omits the new field affinityKey; update the HiveSplit.getRetainedSizeInBytes() implementation to account for the Optional<String> affinityKey by adding the retained size of the Optional itself and, when present, the retained size of the contained String (use the same retained-size utility/pattern used for addresses/partitionName/partitionKeys in the method). Reference the affinityKey field and the getRetainedSizeInBytes() method so the Optional and its inner String are included in the total memory estimate.core/trino-main/src/main/java/io/trino/execution/scheduler/ConsistentHashingAddressProvider.java (1)
88-111:⚠️ Potential issue | 🟡 MinorSet
lastRefreshTimeonly after a successful ring update.Today
lastRefreshTimeis advanced on line 96 before theaddAll/removemutations (lines 102, 105). If a mutation throws, the catch on line 108 logs and returns, leaving the ring only partially updated while the timestamp already reflects "recently refreshed" — so the next refresh is suppressed for anotherWORKER_NODES_CACHE_TIMEOUT_SECS. Moving the timestamp assignment to after the mutations (or inside afinallyonly on success) would let a transient failure self-heal on the next call.🛡️ Proposed fix
`@VisibleForTesting` synchronized void refreshHashRing() { try { Set<TrinoNode> trinoNodes = nodeManager.getNodes(NodeState.ACTIVE).stream() .filter(node -> !node.isCoordinator()) .map(TrinoNode::of) .collect(toImmutableSet()); - lastRefreshTime = System.nanoTime(); Set<TrinoNode> hashRingNodes = consistentHashRing.getNodes(); Set<TrinoNode> removedNodes = Sets.difference(hashRingNodes, trinoNodes); Set<TrinoNode> newNodes = Sets.difference(trinoNodes, hashRingNodes); // Avoid acquiring a write lock in consistentHashRing if possible if (!newNodes.isEmpty()) { consistentHashRing.addAll(newNodes); } if (!removedNodes.isEmpty()) { removedNodes.forEach(consistentHashRing::remove); } + lastRefreshTime = System.nanoTime(); } catch (Exception e) { log.error(e, "Error refreshing hash ring"); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/trino-main/src/main/java/io/trino/execution/scheduler/ConsistentHashingAddressProvider.java` around lines 88 - 111, The method refreshHashRing() updates lastRefreshTime too early (before mutations) which can suppress subsequent refreshes if addAll/remove throw; move the assignment of lastRefreshTime so it only executes after consistentHashRing.addAll(newNodes) and removedNodes.forEach(consistentHashRing::remove) complete successfully (or set it inside the try block after those mutations), ensuring lastRefreshTime reflects a successful refresh; reference refreshHashRing(), lastRefreshTime, consistentHashRing, addAll and remove when applying the change.
🧹 Nitpick comments (3)
core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorSplit.java (1)
44-53: Consider strengthening the Javadoc contract.The PR objective states the contract is "identical keys map to the same worker(s)". The current doc mentions routing of related content but not that equal keys deterministically resolve to the same preferred worker(s) across queries. Worth calling out explicitly so connector authors know how to construct stable keys (e.g., include file path/offset but not per-query state).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorSplit.java` around lines 44 - 53, Update the Javadoc for ConnectorSplit.getAffinityKey() to explicitly state that equal affinity keys must deterministically resolve to the same preferred worker(s) across queries (i.e., identical keys map to the same worker(s)), that only remotely accessible splits may return a key (see isRemotelyAccessible()), and to advise connector implementers to build stable keys (for example using file path and offset rather than per-query state) so affinity remains consistent across runs.core/trino-main/src/test/java/io/trino/execution/scheduler/faulttolerant/TestArbitraryDistributionSplitAssigner.java (1)
62-62: Add test cases for split affinity key routing.The
ConsistentHashingAddressProvideris wired intoArbitraryDistributionSplitAssignerbut test splits viaTestingConnectorSplitreturn empty affinity keys, so the actual affinity key resolution path ingetNodeRequirements()(line 372-374) is never exercised. Consider adding a test case where splits return non-empty affinity keys viagetAffinityKey()to verify the provider's host resolution logic is correct.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/trino-main/src/test/java/io/trino/execution/scheduler/faulttolerant/TestArbitraryDistributionSplitAssigner.java` at line 62, Add a unit test that exercises split affinity key routing by creating a TestingConnectorSplit (or a small test implementation of ConnectorSplit) that returns a non-empty affinity key from getAffinityKey(), then pass that split through ArbitraryDistributionSplitAssigner (wired with the existing CONSISTENT_HASHING_ADDRESS_PROVIDER) and assert that getNodeRequirements() resolves hosts via ConsistentHashingAddressProvider as expected; specifically, locate where ArbitraryDistributionSplitAssigner.getNodeRequirements() is invoked in tests and add a case using a split with a populated affinity key to verify the provider's host resolution logic.plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplitSource.java (1)
307-315: Optional: consider extracting the affinity-key computation for readability.The nested ternary is correct (and
splitBytesis deterministic here becausedeterministicSplitsis only true whenmaxInitialSplits == 0ormaxInitialSplitSize.equals(maxSplitSize)), but lifting it into a small helper would make the two strategies easier to scan.♻️ Possible extraction
- Optional<String> affinityKey = internalSplit.isForceLocalScheduling() - ? Optional.empty() - : deterministicSplits - ? splitAffinityProvider.getKey(internalSplit.getPath(), internalSplit.getStart(), splitBytes) - : splitAffinityProvider.getKey(internalSplit.getPath(), 0, internalSplit.getEstimatedFileSize()); + Optional<String> affinityKey = computeAffinityKey(internalSplit, splitBytes);private Optional<String> computeAffinityKey(InternalHiveSplit internalSplit, long splitBytes) { if (internalSplit.isForceLocalScheduling()) { return Optional.empty(); } if (deterministicSplits) { return splitAffinityProvider.getKey(internalSplit.getPath(), internalSplit.getStart(), splitBytes); } // Use a file-wide key so all splits of the same file land on the same worker and reuse cached content return splitAffinityProvider.getKey(internalSplit.getPath(), 0, internalSplit.getEstimatedFileSize()); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplitSource.java` around lines 307 - 315, Extract the nested ternary that computes the affinityKey into a small helper method (e.g., computeAffinityKey) to improve readability: move the logic that checks internalSplit.isForceLocalScheduling(), deterministicSplits, and calls splitAffinityProvider.getKey(internalSplit.getPath(), ..., ...) into that helper and return Optional.empty() for force-local, the path/start/splitBytes variant when deterministicSplits is true, otherwise the file-wide key using start=0 and internalSplit.getEstimatedFileSize(); then replace the existing ternary with a single call to computeAffinityKey(internalSplit, splitBytes) inside HiveSplitSource.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@core/trino-main/src/main/java/io/trino/execution/scheduler/ConsistentHashingAddressProvider.java`:
- Around line 53-74: The HashRing is mutated in refreshHashRing() while
getHosts() reads it via consistentHashRing.locate(...), causing a race; replace
this with a thread-safe pattern: make the ring a mutable holder (e.g.,
AtomicReference<ConsistentHash<TrinoNode>> ringRef) or use a final lock object
and synchronize both refreshHashRing() and getHosts() around accesses to the
ring. If using AtomicReference, construct a new HashRing instance in
refreshHashRing(), populate it (addAll/remove) off-thread, then atomically swap
ringRef.set(newRing) and have getHosts() call ringRef.get().locate(cacheKey,
preferredHostsCount); if using synchronization, guard all references to
consistentHashRing (both locate and mutation in refreshHashRing) with the same
lock to eliminate concurrent read/write on the non-thread-safe HashRing.
---
Outside diff comments:
In
`@core/trino-main/src/main/java/io/trino/execution/scheduler/ConsistentHashingAddressProvider.java`:
- Around line 88-111: The method refreshHashRing() updates lastRefreshTime too
early (before mutations) which can suppress subsequent refreshes if
addAll/remove throw; move the assignment of lastRefreshTime so it only executes
after consistentHashRing.addAll(newNodes) and
removedNodes.forEach(consistentHashRing::remove) complete successfully (or set
it inside the try block after those mutations), ensuring lastRefreshTime
reflects a successful refresh; reference refreshHashRing(), lastRefreshTime,
consistentHashRing, addAll and remove when applying the change.
In
`@plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplit.java`:
- Around line 48-100: The retained-size calculation for DeltaLakeSplit misses
the affinityKey field, undercounting memory for coordinator split objects;
update the retained size computation (the retainedSize method in DeltaLakeSplit)
to account for the Optional<String> affinityKey by adding its retained size when
present (include the string object's size plus Optional overhead as you handle
other Optional fields like fileRowCount/deletionVector), ensuring affinityKey is
included both in the primary retainedSize implementation and the alternative
branch referenced around lines 178-188 so queued split memory is correctly
tallied.
In `@plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplit.java`:
- Around line 54-150: getRetainedSizeInBytes currently omits the new field
affinityKey; update the HiveSplit.getRetainedSizeInBytes() implementation to
account for the Optional<String> affinityKey by adding the retained size of the
Optional itself and, when present, the retained size of the contained String
(use the same retained-size utility/pattern used for
addresses/partitionName/partitionKeys in the method). Reference the affinityKey
field and the getRetainedSizeInBytes() method so the Optional and its inner
String are included in the total memory estimate.
In `@plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveSplit.java`:
- Around line 77-105: The test seeds a non-empty affinity key on the HiveSplit
but never verifies it after JSON round-trip; add an assertion comparing the
deserialized split's affinity key to the original. Specifically, after creating
expected and actual (via codec.toJson/fromJson), add an assertion that
actual.getAffinityKey().equals(expected.getAffinityKey()) (or use
assertThat(actual.getAffinityKey()).isEqualTo(expected.getAffinityKey())
consistent with other assertions) to ensure the affinity key survives
serialization/deserialization.
---
Nitpick comments:
In
`@core/trino-main/src/test/java/io/trino/execution/scheduler/faulttolerant/TestArbitraryDistributionSplitAssigner.java`:
- Line 62: Add a unit test that exercises split affinity key routing by creating
a TestingConnectorSplit (or a small test implementation of ConnectorSplit) that
returns a non-empty affinity key from getAffinityKey(), then pass that split
through ArbitraryDistributionSplitAssigner (wired with the existing
CONSISTENT_HASHING_ADDRESS_PROVIDER) and assert that getNodeRequirements()
resolves hosts via ConsistentHashingAddressProvider as expected; specifically,
locate where ArbitraryDistributionSplitAssigner.getNodeRequirements() is invoked
in tests and add a case using a split with a populated affinity key to verify
the provider's host resolution logic.
In `@core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorSplit.java`:
- Around line 44-53: Update the Javadoc for ConnectorSplit.getAffinityKey() to
explicitly state that equal affinity keys must deterministically resolve to the
same preferred worker(s) across queries (i.e., identical keys map to the same
worker(s)), that only remotely accessible splits may return a key (see
isRemotelyAccessible()), and to advise connector implementers to build stable
keys (for example using file path and offset rather than per-query state) so
affinity remains consistent across runs.
In `@plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplitSource.java`:
- Around line 307-315: Extract the nested ternary that computes the affinityKey
into a small helper method (e.g., computeAffinityKey) to improve readability:
move the logic that checks internalSplit.isForceLocalScheduling(),
deterministicSplits, and calls
splitAffinityProvider.getKey(internalSplit.getPath(), ..., ...) into that helper
and return Optional.empty() for force-local, the path/start/splitBytes variant
when deterministicSplits is true, otherwise the file-wide key using start=0 and
internalSplit.getEstimatedFileSize(); then replace the existing ternary with a
single call to computeAffinityKey(internalSplit, splitBytes) inside
HiveSplitSource.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 00411116-6a6f-4665-a0ef-32a03d1ec5d8
📒 Files selected for processing (52)
core/trino-main/pom.xmlcore/trino-main/src/main/java/io/trino/execution/scheduler/ConsistentHashingAddressProvider.javacore/trino-main/src/main/java/io/trino/execution/scheduler/ConsistentHashingAddressProviderConfig.javacore/trino-main/src/main/java/io/trino/execution/scheduler/TopologyAwareNodeSelector.javacore/trino-main/src/main/java/io/trino/execution/scheduler/TopologyAwareNodeSelectorFactory.javacore/trino-main/src/main/java/io/trino/execution/scheduler/UniformNodeSelector.javacore/trino-main/src/main/java/io/trino/execution/scheduler/UniformNodeSelectorFactory.javacore/trino-main/src/main/java/io/trino/execution/scheduler/faulttolerant/ArbitraryDistributionSplitAssigner.javacore/trino-main/src/main/java/io/trino/execution/scheduler/faulttolerant/EventDrivenTaskSourceFactory.javacore/trino-main/src/main/java/io/trino/metadata/Split.javacore/trino-main/src/main/java/io/trino/server/CoordinatorModule.javacore/trino-main/src/main/java/io/trino/testing/PlanTester.javacore/trino-main/src/test/java/io/trino/execution/BenchmarkNodeScheduler.javacore/trino-main/src/test/java/io/trino/execution/TestNodeScheduler.javacore/trino-main/src/test/java/io/trino/execution/scheduler/TestMultiSourcePartitionedScheduler.javacore/trino-main/src/test/java/io/trino/execution/scheduler/TestScaledWriterScheduler.javacore/trino-main/src/test/java/io/trino/execution/scheduler/TestSourcePartitionedScheduler.javacore/trino-main/src/test/java/io/trino/execution/scheduler/TestUniformNodeSelector.javacore/trino-main/src/test/java/io/trino/execution/scheduler/faulttolerant/TestArbitraryDistributionSplitAssigner.javacore/trino-spi/src/main/java/io/trino/spi/connector/ConnectorSplit.javalib/trino-filesystem-cache-alluxio/src/main/java/io/trino/filesystem/alluxio/AlluxioFileSystemCacheModule.javalib/trino-filesystem-manager/src/main/java/io/trino/filesystem/manager/FileSystemModule.javalib/trino-filesystem/pom.xmllib/trino-filesystem/src/main/java/io/trino/filesystem/cache/CacheSplitAffinityProvider.javalib/trino-filesystem/src/main/java/io/trino/filesystem/cache/CachingHostAddressProvider.javalib/trino-filesystem/src/main/java/io/trino/filesystem/cache/NoopSplitAffinityProvider.javalib/trino-filesystem/src/main/java/io/trino/filesystem/cache/SplitAffinityProvider.javalib/trino-filesystem/src/test/java/io/trino/filesystem/cache/TestConsistentHashingCacheHostAddressProvider.javaplugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplit.javaplugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.javaplugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.javaplugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeSplitManager.javaplugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplit.javaplugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplitManager.javaplugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSplitSource.javaplugin/trino-hive/src/test/java/io/trino/plugin/hive/TestBackgroundHiveSplitLoader.javaplugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHivePageSink.javaplugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveSplit.javaplugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveSplitSource.javaplugin/trino-hive/src/test/java/io/trino/plugin/hive/TestNodeLocalDynamicSplitPruning.javaplugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplit.javaplugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitManager.javaplugin/trino-hudi/src/main/java/io/trino/plugin/hudi/HudiSplitSource.javaplugin/trino-hudi/src/main/java/io/trino/plugin/hudi/split/HudiBackgroundSplitLoader.javaplugin/trino-hudi/src/main/java/io/trino/plugin/hudi/split/HudiSplitFactory.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergBucketFunction.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplit.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergNodeLocalDynamicSplitPruning.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergSplitSource.java
💤 Files with no reviewable changes (3)
- lib/trino-filesystem/pom.xml
- lib/trino-filesystem/src/main/java/io/trino/filesystem/cache/CachingHostAddressProvider.java
- lib/trino-filesystem/src/test/java/io/trino/filesystem/cache/TestConsistentHashingCacheHostAddressProvider.java
d8de793 to
3f5d92e
Compare
|
You're effectively hard-wiring all connectors to a single consistent-hash strategy whenever |
The old |
Once you are rollback to use |
If |
chenjian2664
left a comment
There was a problem hiding this comment.
Please rebase fix the ci failures
2c3e179 to
4a15a9f
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
Add an optional affinity key to ConnectorSplit. When a split returns a key, the engine maps it to preferred worker addresses so splits reading related content land on the same worker across queries. When empty, scheduling keeps using split.getAddresses(). Only remotely accessible splits may carry an affinity key; the engine's Split wrapper enforces this, since a non-remote split is already pinned to its own addresses.
Populate HiveSplit's affinity key when filesystem caching is enabled so splits reading the same file reuse worker-local cache. Hive no longer injects CachingHostAddressProvider; it receives a SplitAffinityProvider configured by FileSystemModule. HDFS block locations still flow through HiveSplit.getAddresses(), preserving the forceLocalScheduling path for clusters that pin splits to specific hosts.
Populate DeltaLakeSplit's affinity key when filesystem caching is enabled so splits reading the same file reuse worker-local cache. Delta no longer injects CachingHostAddressProvider. Drop the unused addresses field; Delta never emitted HDFS block locations or used the forceLocalScheduling path.
Populate IcebergSplit's affinity key when filesystem caching is enabled so splits reading the same file reuse worker-local cache. Iceberg no longer injects CachingHostAddressProvider. Drop the unused addresses field; Iceberg never emitted HDFS block locations or used the forceLocalScheduling path. As the last connector to rely on the per-catalog CachingHostAddressProvider, this commit also removes the interface and its implementations from lib/trino-filesystem, drops the allgood-consistent-hash dependency there (the engine now owns it), and removes the now-unused bindings in FileSystemModule and AlluxioFileSystemCacheModule.
Populate HudiSplit's affinity key when filesystem caching is enabled so splits reading the same file reuse worker-local cache. This is net-new cache-aware scheduling for Hudi; no prior placement mechanism existed. HudiSplitFactory and its loader chain receive SplitAffinityProvider so split creation can consult it for each file range.
4a15a9f to
fa339b2
Compare
fa339b2 to
a4c4c8c
Compare
Description
Move split-to-worker placement for cache-aware scheduling out of connectors and into the engine, behind a new
ConnectorSplit#getAffinityKey()SPI. Connectors return an opaque key; the engine maps it to preferred workers via a single consistent-hash ring.Why
Today, Hive, Delta, and Iceberg each inject a per-catalog
CachingHostAddressProviderand compute split→worker placement themselves. This puts scheduling logic in connectors, duplicates a hash-ring + refresher thread per catalog over the same worker set, and leaves Hudi out because wiring it requires copying the pattern. The nameCacheKeyalso couples the API to filesystem caching, but the underlying mechanism — "splits with the same key run on the same worker across queries" — is a general node-affinity primitive.Approach
ConnectorSplit#getAffinityKey(): Optional<String>. Contract: same key ⇒ same worker(s); only remotely accessible splits may supply a key. The engine'sSplitwrapper enforces the latter.ConsistentHashingAddressProviderbound inCoordinatorModule, lazily refreshed on access (no background executor). Scheduler call sites consult the key inline; when empty,split.getAddresses()is used unchanged.SplitAffinityProviderinterface with a no-op default and a cache-enabled impl bound on the coordinator when `fs.cache.enabled=true`.SplitAffinityProviderand callgetKey(path, offset, length)during split creation. Hive keeps itsforceLocalScheduling+ HDFS block-location path.CachingHostAddressProvider+ its impls + its test, drop theallgood-consistent-hashdep fromlib/trino-filesystem, remove the old Guice bindings.Additional context and related issues
Suggested commit-order review:
SplitAffinityProviderand its bindings.Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: