-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-4268] Spill key optimizations #24829
Conversation
This is a refactor of the inner loop in spill_key_index::add_key into a coroutine which will be modified in subsequent commits. Signed-off-by: Noah Watkins <[email protected]>
This is effectively a refactor commit which splits the spill(key) interface into separate functions: Prior to this commit the spill key index had the spill(key) interface which would accept a single key. This commit introduces a spill_payload type which can be used to collect many spilled keys and then spill them all at once with a spill(payload) interface. Signed-off-by: Noah Watkins <[email protected]>
I don't want it using instance state because i'm going to use it for estimates without actually changing state. Signed-off-by: Noah Watkins <[email protected]>
When keys must be spilled, collect them all in a buffer and spill them all at once to avoid per-key co_await invocations and calls into the seastar IO subsystem. Signed-off-by: Noah Watkins <[email protected]>
CI test resultstest results on build#60805
test results on build#60823
test results on build#60847
|
Try to start spilling from a dynamic location in the index based on the last inserted key as a seed. The actual removal starts after the last inserted key which will map to a random key and (hopefully) not a key that has temporal locality with recently inserted items. If a index receives a high number of duplicates then the removal should be no worse than previous repeated removal from begin(). Signed-off-by: Noah Watkins <[email protected]>
🔥 |
src/v/storage/spill_key_index.cc
Outdated
* Evict first entry, we use hash function that guarante good | ||
* randomness so evicting first entry is actually evicting a | ||
* pseudo random elemnent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as Travis pointed out in the previous PR, this isn't necessarily true. It's just copied here verbatim in the first refactor commit.
* | ||
* In order to avoid creating hot erase spots we start removal at the | ||
* location immediately after the last inserted key. The exact last inserted | ||
* key is skipped to maintain good hit locality, and the next key should not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact last inserted key is skipped to maintain good hit locality
This read a bit weird to me looking through this again. To clarify, what I mean is that we shouldn't remove the last inserted key for fear of destroying workloads that repeat the same key.
It is a bit hard to evaluate the constant factor stuff (everything except the change to how the elements to spill are selected) without a benchmark, but assuming we know the O(n^2) |
* introduced to reduce the chance of a reactor stall. | ||
* | ||
* In order to avoid creating hot erase spots we start removal at the | ||
* location immediately after the last inserted key. The exact last inserted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that some keys are more common than others, won't this have the same kind of problem as begin(), though to a lesser extent unless it's just 1 common key? Each common key acts as deletion point, and skipping to the next key doesn't help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, the comment is sloppy, I'll update it. There are two things at play here:
- With respect to phrase start removal at the location immediately after the last inserted key: for a workload where there is a strong bias towards a set of keys, we'd like to avoid evicting them because the longer they stay in the index the more they can help deduplicate. In this PR we don't track information to help tell us if a key is hot, so we assume the last inserted key is hot, and that the next key in the index that the iterator points to is independent / uncorrelated.
- With respect to the phrase in order to avoid creating hot erase spots: to the extent that we spill from a specific point in the index and that repeatedly starting at a specific point (e.g.
begin()
) is what results in the hot spot, the PR attempts to move that hot spot around by starting at the last inserted key. I don't yet know if this is sufficient for the abseil data structure. It could be that we need to spread the eviction out over a much larger area of the key space (e.g. your proposal of randomly sampling from a mirrored vector of keys). For this it may be sufficient to merely a few unique last N inserted keys as an enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to see comparison benchmarks for various workloads (though that's a larger ask than I think is required of the PR). Is it correct to say that it could still trend towards quadratic (though, it is less likely to in the general case) for certain workloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption in the approach is that when it comes time to spill that the last added key isn't the same as the last time we spilled. that isn't guaranteed of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnwat - yeah, I understood the comment but I guess my comment is that if say 1 key is very hot then the performance is the same as before, right? We just moved the (single) hotspot from begin() to "right after the hot key".
If there are 10 hot keys, I guess the problem is 10 times better since each hotspot will be 1/10th as long. Maybe that's enough, but I would lean towards doing it right if possible because if this bites us even just once out in the wild it will cost a lot. However, maybe doing it right is too costly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 key is very hot then the performance is the same as before, right?
correct.
If there are 10 hot keys, I guess the problem is 10 times better since each hotspot will be 1/10th as long. Maybe that's enough, but I would lean towards doing it right
i think then the best minimal enhancement here would be to keep a memory of more than 1 key for spreading the load.
I would lean towards doing it right
curious if you were implying something specific here, or rather, doing it right as in making sure we have a resilient solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious if you were implying something specific here, or rather, doing it right as in making sure we have a resilient solution?
Yeah I meant doing it in a way that is true random sampling or quite close to it. Since it's hard to actually assess how this one will work in practice, and is has some cases where it will fall into the same O(n^2) pit as the last one.
Examples would include:
- Using a associated data structure to allow selecting random keys (like a vector)
- Digging into the internals of the hashmap to do random selection. E.g., you could select randomly from open-addressed hash map by selecting a random bucket and removing the element found there (trying again if it was empty).
However, the first sounds expensive in memory, and the second sounds expensive in effort and risk (changing hash map, or the unsavory approach of digging into private internals). That's why I am not actually pushing either of those.
So I am OK with the current approach as at least being better that the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we apply some similar changes in this PR to spill_key_index::drain_all_keys()
? There's also a loop here with _midx.extract(_midx.begin())
present (and a scheduling point per key).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh, probably a good idea. i'll try to add a simple benchmark
This PR attempts to implement two suggestions by @travisdowns related to the spilling of compaction keys to disk. Both are taken from this discussion on @StephanDollberg's PR https://github.com/redpanda-data/redpanda/pull/19836/files#r1646331859.
This one is implemented in this PR. We choose a starting point in the index, and then build up the buffer to be spilled to disk without a suspension point per key.
This one is approximated not by generating a random sequence of keys to spill but rather by choosing a random location in the index from which to start spilling keys. That is, something other than the current fixed location
begin()
. So we may still get hot spots, but we try to move the hot spot around.The approximation isn't particularly robust. We use the last inserted key as the "random" location. So a workload that skewed heavily towards a few keys would have a few hot spots. Still, perhaps better than always using
begin()
.As @StephanDollberg pointed out, we don't currently have any great benchmarking tools for compaction so I can't provide any concrete numbers here. I think that's something we may be able to add but it kinda sounds like we were ready to move ahead with the previous PR attempt sans benchmarks?
Backports Required
Release Notes