Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-4268] Spill key optimizations #24829

Merged
merged 5 commits into from
Jan 24, 2025
Merged

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Jan 15, 2025

This PR attempts to implement two suggestions by @travisdowns related to the spilling of compaction keys to disk. Both are taken from this discussion on @StephanDollberg's PR https://github.com/redpanda-data/redpanda/pull/19836/files#r1646331859.

What if we just extract all the keys first, then loop over all of them and spill them? I guess the stopping condition would have to be somewhat different because currently we rely on release_entry_memory to be called once per extraction to update the memory usage estimate.

This one is implemented in this PR. We choose a starting point in the index, and then build up the buffer to be spilled to disk without a suspension point per key.

Other option is just to maintain a separate vector-ish of compaction_keys in _midx: you can select randomly from this vector easily and erase that element from both. This solves the bias problem and probably ends up cheaper anyway because right now even the occasional begin() call is going to be very expensive over time as the map gets highly warped (all elements condensed on the right hand side of the bucket space). Of course, we won't want to duplicate the key itself.

This one is approximated not by generating a random sequence of keys to spill but rather by choosing a random location in the index from which to start spilling keys. That is, something other than the current fixed location begin(). So we may still get hot spots, but we try to move the hot spot around.

The approximation isn't particularly robust. We use the last inserted key as the "random" location. So a workload that skewed heavily towards a few keys would have a few hot spots. Still, perhaps better than always using begin().

As @StephanDollberg pointed out, we don't currently have any great benchmarking tools for compaction so I can't provide any concrete numbers here. I think that's something we may be able to add but it kinda sounds like we were ready to move ahead with the previous PR attempt sans benchmarks?

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

This is a refactor of the inner loop in spill_key_index::add_key into a
coroutine which will be modified in subsequent commits.

Signed-off-by: Noah Watkins <[email protected]>
This is effectively a refactor commit which splits the spill(key)
interface into separate functions:

Prior to this commit the spill key index had the spill(key) interface
which would accept a single key. This commit introduces a spill_payload
type which can be used to collect many spilled keys and then spill them
all at once with a spill(payload) interface.

Signed-off-by: Noah Watkins <[email protected]>
I don't want it using instance state because i'm going to use it for
estimates without actually changing state.

Signed-off-by: Noah Watkins <[email protected]>
When keys must be spilled, collect them all in a buffer and spill them
all at once to avoid per-key co_await invocations and calls into the
seastar IO subsystem.

Signed-off-by: Noah Watkins <[email protected]>
@dotnwat dotnwat changed the title Core 4268 [CORE-4268] Spill key optimizations Jan 15, 2025
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 16, 2025

CI test results

test results on build#60805
test_id test_kind job_url test_status passed
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60805#01946c19-7568-4334-89f1-cf0537e5edf3 FAIL 0/2
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/60805#01946c19-7569-4357-8853-5ca2f41de72f FAIL 0/2
test results on build#60823
test_id test_kind job_url test_status passed
gtest_storage_e2e_rpfixture.gtest_storage_e2e_rpfixture unit https://buildkite.com/redpanda/redpanda/builds/60823#01946da6-23ef-4e88-bec5-7a82f74f4d7d FLAKY 1/2
test results on build#60847
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/60847#0194700c-1ac2-4993-9205-57d60d158125 FLAKY 5/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli ducktape https://buildkite.com/redpanda/redpanda/builds/60847#01947011-26bb-41bb-ad4c-505d4aa1f451 FLAKY 2/6

Try to start spilling from a dynamic location in the index based on the
last inserted key as a seed. The actual removal starts after the last
inserted key which will map to a random key and (hopefully) not a key
that has temporal locality with recently inserted items.

If a index receives a high number of duplicates then the removal should
be no worse than previous repeated removal from begin().

Signed-off-by: Noah Watkins <[email protected]>
@WillemKauf
Copy link
Contributor

🔥

Comment on lines 109 to 111
* Evict first entry, we use hash function that guarante good
* randomness so evicting first entry is actually evicting a
* pseudo random elemnent
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as Travis pointed out in the previous PR, this isn't necessarily true. It's just copied here verbatim in the first refactor commit.

*
* In order to avoid creating hot erase spots we start removal at the
* location immediately after the last inserted key. The exact last inserted
* key is skipped to maintain good hit locality, and the next key should not
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact last inserted key is skipped to maintain good hit locality

This read a bit weird to me looking through this again. To clarify, what I mean is that we shouldn't remove the last inserted key for fear of destroying workloads that repeat the same key.

@travisdowns
Copy link
Member

travisdowns commented Jan 18, 2025

It is a bit hard to evaluate the constant factor stuff (everything except the change to how the elements to spill are selected) without a benchmark, but assuming we know the O(n^2) "delete at begin()" is part is the main problem, then I guess we can analyze that part.

* introduced to reduce the chance of a reactor stall.
*
* In order to avoid creating hot erase spots we start removal at the
* location immediately after the last inserted key. The exact last inserted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that some keys are more common than others, won't this have the same kind of problem as begin(), though to a lesser extent unless it's just 1 common key? Each common key acts as deletion point, and skipping to the next key doesn't help with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, the comment is sloppy, I'll update it. There are two things at play here:

  1. With respect to phrase start removal at the location immediately after the last inserted key: for a workload where there is a strong bias towards a set of keys, we'd like to avoid evicting them because the longer they stay in the index the more they can help deduplicate. In this PR we don't track information to help tell us if a key is hot, so we assume the last inserted key is hot, and that the next key in the index that the iterator points to is independent / uncorrelated.
  2. With respect to the phrase in order to avoid creating hot erase spots: to the extent that we spill from a specific point in the index and that repeatedly starting at a specific point (e.g. begin()) is what results in the hot spot, the PR attempts to move that hot spot around by starting at the last inserted key. I don't yet know if this is sufficient for the abseil data structure. It could be that we need to spread the eviction out over a much larger area of the key space (e.g. your proposal of randomly sampling from a mirrored vector of keys). For this it may be sufficient to merely a few unique last N inserted keys as an enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to see comparison benchmarks for various workloads (though that's a larger ask than I think is required of the PR). Is it correct to say that it could still trend towards quadratic (though, it is less likely to in the general case) for certain workloads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption in the approach is that when it comes time to spill that the last added key isn't the same as the last time we spilled. that isn't guaranteed of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnwat - yeah, I understood the comment but I guess my comment is that if say 1 key is very hot then the performance is the same as before, right? We just moved the (single) hotspot from begin() to "right after the hot key".

If there are 10 hot keys, I guess the problem is 10 times better since each hotspot will be 1/10th as long. Maybe that's enough, but I would lean towards doing it right if possible because if this bites us even just once out in the wild it will cost a lot. However, maybe doing it right is too costly here.

Copy link
Member Author

@dotnwat dotnwat Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisdowns

1 key is very hot then the performance is the same as before, right?

correct.

If there are 10 hot keys, I guess the problem is 10 times better since each hotspot will be 1/10th as long. Maybe that's enough, but I would lean towards doing it right

i think then the best minimal enhancement here would be to keep a memory of more than 1 key for spreading the load.

I would lean towards doing it right

curious if you were implying something specific here, or rather, doing it right as in making sure we have a resilient solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if you were implying something specific here, or rather, doing it right as in making sure we have a resilient solution?

Yeah I meant doing it in a way that is true random sampling or quite close to it. Since it's hard to actually assess how this one will work in practice, and is has some cases where it will fall into the same O(n^2) pit as the last one.

Examples would include:

  • Using a associated data structure to allow selecting random keys (like a vector)
  • Digging into the internals of the hashmap to do random selection. E.g., you could select randomly from open-addressed hash map by selecting a random bucket and removing the element found there (trying again if it was empty).

However, the first sounds expensive in memory, and the second sounds expensive in effort and risk (changing hash map, or the unsavory approach of digging into private internals). That's why I am not actually pushing either of those.

So I am OK with the current approach as at least being better that the status quo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we apply some similar changes in this PR to spill_key_index::drain_all_keys()? There's also a loop here with _midx.extract(_midx.begin()) present (and a scheduling point per key).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh, probably a good idea. i'll try to add a simple benchmark

@dotnwat dotnwat merged commit d677af1 into redpanda-data:dev Jan 24, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants