[CORE-4268] Spill key optimizations #24829

dotnwat · 2025-01-15T22:29:32Z

This PR attempts to implement two suggestions by @travisdowns related to the spilling of compaction keys to disk. Both are taken from this discussion on @StephanDollberg's PR https://github.com/redpanda-data/redpanda/pull/19836/files#r1646331859.

What if we just extract all the keys first, then loop over all of them and spill them? I guess the stopping condition would have to be somewhat different because currently we rely on release_entry_memory to be called once per extraction to update the memory usage estimate.

This one is implemented in this PR. We choose a starting point in the index, and then build up the buffer to be spilled to disk without a suspension point per key.

Other option is just to maintain a separate vector-ish of compaction_keys in _midx: you can select randomly from this vector easily and erase that element from both. This solves the bias problem and probably ends up cheaper anyway because right now even the occasional begin() call is going to be very expensive over time as the map gets highly warped (all elements condensed on the right hand side of the bucket space). Of course, we won't want to duplicate the key itself.

This one is approximated not by generating a random sequence of keys to spill but rather by choosing a random location in the index from which to start spilling keys. That is, something other than the current fixed location begin(). So we may still get hot spots, but we try to move the hot spot around.

The approximation isn't particularly robust. We use the last inserted key as the "random" location. So a workload that skewed heavily towards a few keys would have a few hot spots. Still, perhaps better than always using begin().

As @StephanDollberg pointed out, we don't currently have any great benchmarking tools for compaction so I can't provide any concrete numbers here. I think that's something we may be able to add but it kinda sounds like we were ready to move ahead with the previous PR attempt sans benchmarks?

Backports Required

Release Notes

none

This is a refactor of the inner loop in spill_key_index::add_key into a coroutine which will be modified in subsequent commits. Signed-off-by: Noah Watkins <[email protected]>

This is effectively a refactor commit which splits the spill(key) interface into separate functions: Prior to this commit the spill key index had the spill(key) interface which would accept a single key. This commit introduces a spill_payload type which can be used to collect many spilled keys and then spill them all at once with a spill(payload) interface. Signed-off-by: Noah Watkins <[email protected]>

I don't want it using instance state because i'm going to use it for estimates without actually changing state. Signed-off-by: Noah Watkins <[email protected]>

When keys must be spilled, collect them all in a buffer and spill them all at once to avoid per-key co_await invocations and calls into the seastar IO subsystem. Signed-off-by: Noah Watkins <[email protected]>

vbotbuildovich · 2025-01-16T01:30:57Z

CI test results

test results on build#60805

test_id	test_kind	job_url	test_status	passed
storage_single_thread_rpunit.storage_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60805#01946c19-7568-4334-89f1-cf0537e5edf3	FAIL	0/2
storage_single_thread_rpunit.storage_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60805#01946c19-7569-4357-8853-5ca2f41de72f	FAIL	0/2

test results on build#60823

test_id	test_kind	job_url	test_status	passed
gtest_storage_e2e_rpfixture.gtest_storage_e2e_rpfixture	unit	https://buildkite.com/redpanda/redpanda/builds/60823#01946da6-23ef-4e88-bec5-7a82f74f4d7d	FLAKY	1/2

test results on build#60847

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60847#0194700c-1ac2-4993-9205-57d60d158125	FLAKY	5/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60847#01947011-26bb-41bb-ad4c-505d4aa1f451	FLAKY	2/6

Try to start spilling from a dynamic location in the index based on the last inserted key as a seed. The actual removal starts after the last inserted key which will map to a random key and (hopefully) not a key that has temporal locality with recently inserted items. If a index receives a high number of duplicates then the removal should be no worse than previous repeated removal from begin(). Signed-off-by: Noah Watkins <[email protected]>

WillemKauf · 2025-01-16T16:36:08Z

🔥

dotnwat · 2025-01-16T23:24:50Z

src/v/storage/spill_key_index.cc

+         * Evict first entry, we use hash function that guarante good
+         * randomness so evicting first entry is actually evicting a
+         * pseudo random elemnent


as Travis pointed out in the previous PR, this isn't necessarily true. It's just copied here verbatim in the first refactor commit.

dotnwat · 2025-01-16T23:29:18Z

src/v/storage/spill_key_index.cc

+     *
+     * In order to avoid creating hot erase spots we start removal at the
+     * location immediately after the last inserted key. The exact last inserted
+     * key is skipped to maintain good hit locality, and the next key should not


The exact last inserted key is skipped to maintain good hit locality

This read a bit weird to me looking through this again. To clarify, what I mean is that we shouldn't remove the last inserted key for fear of destroying workloads that repeat the same key.

travisdowns · 2025-01-18T00:28:00Z

It is a bit hard to evaluate the constant factor stuff (everything except the change to how the elements to spill are selected) without a benchmark, but assuming we know the O(n^2) "delete at begin()" is part is the main problem, then I guess we can analyze that part.

travisdowns · 2025-01-18T01:12:04Z

src/v/storage/spill_key_index.cc

+     * introduced to reduce the chance of a reactor stall.
+     *
+     * In order to avoid creating hot erase spots we start removal at the
+     * location immediately after the last inserted key. The exact last inserted


Assuming that some keys are more common than others, won't this have the same kind of problem as begin(), though to a lesser extent unless it's just 1 common key? Each common key acts as deletion point, and skipping to the next key doesn't help with that.

Ahh, the comment is sloppy, I'll update it. There are two things at play here:

With respect to phrase start removal at the location immediately after the last inserted key: for a workload where there is a strong bias towards a set of keys, we'd like to avoid evicting them because the longer they stay in the index the more they can help deduplicate. In this PR we don't track information to help tell us if a key is hot, so we assume the last inserted key is hot, and that the next key in the index that the iterator points to is independent / uncorrelated.

With respect to the phrase in order to avoid creating hot erase spots: to the extent that we spill from a specific point in the index and that repeatedly starting at a specific point (e.g. begin()) is what results in the hot spot, the PR attempts to move that hot spot around by starting at the last inserted key. I don't yet know if this is sufficient for the abseil data structure. It could be that we need to spread the eviction out over a much larger area of the key space (e.g. your proposal of randomly sampling from a mirrored vector of keys). For this it may be sufficient to merely a few unique last N inserted keys as an enhancement.

It would be interesting to see comparison benchmarks for various workloads (though that's a larger ask than I think is required of the PR). Is it correct to say that it could still trend towards quadratic (though, it is less likely to in the general case) for certain workloads?

The assumption in the approach is that when it comes time to spill that the last added key isn't the same as the last time we spilled. that isn't guaranteed of course.

@dotnwat - yeah, I understood the comment but I guess my comment is that if say 1 key is very hot then the performance is the same as before, right? We just moved the (single) hotspot from begin() to "right after the hot key".

If there are 10 hot keys, I guess the problem is 10 times better since each hotspot will be 1/10th as long. Maybe that's enough, but I would lean towards doing it right if possible because if this bites us even just once out in the wild it will cost a lot. However, maybe doing it right is too costly here.

@travisdowns

1 key is very hot then the performance is the same as before, right?

correct.

If there are 10 hot keys, I guess the problem is 10 times better since each hotspot will be 1/10th as long. Maybe that's enough, but I would lean towards doing it right

i think then the best minimal enhancement here would be to keep a memory of more than 1 key for spreading the load.

I would lean towards doing it right

curious if you were implying something specific here, or rather, doing it right as in making sure we have a resilient solution?

curious if you were implying something specific here, or rather, doing it right as in making sure we have a resilient solution?

Yeah I meant doing it in a way that is true random sampling or quite close to it. Since it's hard to actually assess how this one will work in practice, and is has some cases where it will fall into the same O(n^2) pit as the last one.

Examples would include:

Using a associated data structure to allow selecting random keys (like a vector)

Digging into the internals of the hashmap to do random selection. E.g., you could select randomly from open-addressed hash map by selecting a random bucket and removing the element found there (trying again if it was empty).

However, the first sounds expensive in memory, and the second sounds expensive in effort and risk (changing hash map, or the unsavory approach of digging into private internals). That's why I am not actually pushing either of those.

So I am OK with the current approach as at least being better that the status quo.

WillemKauf · 2025-01-20T15:03:58Z

src/v/storage/spill_key_index.cc

Can we apply some similar changes in this PR to spill_key_index::drain_all_keys()? There's also a loop here with _midx.extract(_midx.begin()) present (and a scheduling point per key).

yeh, probably a good idea. i'll try to add a simple benchmark

dotnwat added 4 commits January 15, 2025 12:05

storage: convert key spilling loop into coroutine

f578d10

This is a refactor of the inner loop in spill_key_index::add_key into a coroutine which will be modified in subsequent commits. Signed-off-by: Noah Watkins <[email protected]>

storage: make mem usage estimate static

06e81fd

I don't want it using instance state because i'm going to use it for estimates without actually changing state. Signed-off-by: Noah Watkins <[email protected]>

storage: spill many keys at once

0b64dd8

When keys must be spilled, collect them all in a buffer and spill them all at once to avoid per-key co_await invocations and calls into the seastar IO subsystem. Signed-off-by: Noah Watkins <[email protected]>

github-actions bot added the area/redpanda label Jan 15, 2025

dotnwat changed the title ~~Core 4268~~ [CORE-4268] Spill key optimizations Jan 15, 2025

dotnwat force-pushed the CORE-4268 branch from 11a7913 to f213291 Compare January 16, 2025 02:28

dotnwat requested review from travisdowns, StephanDollberg, WillemKauf and andrwng January 16, 2025 23:23

dotnwat commented Jan 16, 2025

View reviewed changes

travisdowns reviewed Jan 18, 2025

View reviewed changes

WillemKauf reviewed Jan 20, 2025

View reviewed changes

travisdowns approved these changes Jan 21, 2025

View reviewed changes

travisdowns approved these changes Jan 24, 2025

View reviewed changes

dotnwat merged commit d677af1 into redpanda-data:dev Jan 24, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-4268] Spill key optimizations #24829

[CORE-4268] Spill key optimizations #24829

dotnwat commented Jan 15, 2025 •

edited

Loading

vbotbuildovich commented Jan 16, 2025 •

edited

Loading

WillemKauf commented Jan 16, 2025

dotnwat Jan 16, 2025

dotnwat Jan 16, 2025

travisdowns commented Jan 18, 2025 •

edited

Loading

travisdowns Jan 18, 2025

dotnwat Jan 18, 2025

WillemKauf Jan 20, 2025

dotnwat Jan 20, 2025

travisdowns Jan 21, 2025

dotnwat Jan 22, 2025 •

edited

Loading

travisdowns Jan 24, 2025

WillemKauf Jan 20, 2025

dotnwat Jan 20, 2025

[CORE-4268] Spill key optimizations #24829

[CORE-4268] Spill key optimizations #24829

Conversation

dotnwat commented Jan 15, 2025 • edited Loading

Backports Required

Release Notes

vbotbuildovich commented Jan 16, 2025 • edited Loading

CI test results

WillemKauf commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns commented Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented Jan 15, 2025 •

edited

Loading

vbotbuildovich commented Jan 16, 2025 •

edited

Loading

travisdowns commented Jan 18, 2025 •

edited

Loading

dotnwat Jan 22, 2025 •

edited

Loading