Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy Avoidance #1457

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

alexander-shabanov
Copy link

@alexander-shabanov alexander-shabanov commented Dec 18, 2024

Overview

This PR introduces the ability to avoid copying whole content of string object into replies (i.e. bulk string replies) and to allow I/O threads refer directly to obj->ptr in writev iov as described at #1353.

Key Changes

  • Added capability to reply construction allowing to interleave regular replies with copy avoid replies in client reply buffers
  • Extended write-to-client handlers to support copy avoid replies
  • Added copy avoidance of string bulk replies when copy avoidance indicated by I/O threads
  • Minor changes in cluster slots stats in order to support network-bytes-out for copy avoid replies
  • Copy avoidance is beneficial for performance despite object size only starting certain number of threads. So it will be enabled only starting certain number of threads. Internal configuration min-io-threads-copy-avoid introduced to manage this number of threads
  • Copy avoidance is even more efficient starting certain number of threads without value prefetching. Internal configuration min-io-threads-value-prefetch-off introduced to manage this number of threads

Note: When copy avoidance disabled content and handling of client reply buffers remains as before this PR

Implementation Details

client and clientReplyBlock structs:

  1. buf_encoded flag has been added to clientReplyBlock struct and to client struct for static c->buf to indicate if reply buffer is in copy avoidance mode (i.e. include headers and payloads) or not (i.e. plain replies only).
  2. io_last_written_buf, io_last_written_bufpos, io_last_written_data_len fields added client struct to to keep track of write state between writevToClient invocations

Reply construction:

  1. Original _addReplyToBuffer and _addReplyProtoToList have been renamed to _addReplyPayloadToBuffer and _addReplyPayloadToList and extended to support different types of payloads - regular replies and copy avoid replies.
  2. New _addReplyToBuffer and _addReplyProtoToList calls now _addReplyPayloadToBuffer and _addReplyPayloadToList and used for adding regular replies to client reply buffers.
  3. Newly introduced _addBulkOffloadToBuffer and _addBulkOffloadToList are used for adding copy avoid replies to client reply buffers.

Write-to-client infrastructure:

The writevToClient and _postWriteToClient has been significantly changed to support copy avoidance capability.

Internal configuration:

  1. min-io-threads-copy-avoid - Minimum number of IO threads for copy avoidance
  2. min-io-threads-value-prefetch-off - Minimum number of IO threads for disabling value prefetch
  3. min-string-size-copy-avoid - Minimum bulk string size for copy avoidance when IO threads disabled
  4. min-string-size-copy-avoid-threaded - Minimum bulk string size for copy avoidance when IO threads enabled

Testing

  1. Existing unit and integration tests passed. Copy avoidance enabled on tests with --io-threads flag
  2. Added unit tests for copy avoidance functionality

Performance Tests

Note: pay attention io-threads 1 config means only main thread with no additional io-threads, io-threads 2 means main thread plus 1 I/O thread, io-threads 9 means main thread plus 8 I/O threads.

512 byte object size

Tests are conducted on memory optimized instances using:

  • 3,000,000 keys
  • 512 bytes object size
  • 1000 clients
io-threads (including main thread) Plain Reply Copy Avoidance
7 1,160,000 1,160,000
8 1,150,000 1,280,000
9 1,150,000 1,330,000
10 N/A 1,380,000
11 N/A 1,420,000

64 KB object size

Tests are conducted on network optimized instances using:

  • 500 clients
iothreads Data size Keys Plain Reply Copy Avoidance Comment
1 16K 1,000,000 100,000 120,000
2 16K 1,000,000 169,000 135,000
1 32K 500,000 57,000 90,000
2 32K 500,000 110,000 96,000
3 32K 500,000 200,000 190,000
1 64K 100,000 30,000 57,000
2 64K 100,000 69,000 61,000
3 64K 100,000 120,000 117,000
4 64K 100,000 115,000 - 175,000 175,000 Starting 4 threads plain reply performance not stable
5 64K 100,000 115,000 - 165,000 230,000

valkey.conf Outdated Show resolved Hide resolved
src/config.c Outdated Show resolved Hide resolved
@madolson
Copy link
Member

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
@zuiderkwast
Copy link
Contributor

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

@alexander-shabanov
Copy link
Author

@alexander-shabanov alexander-shabanov force-pushed the reply_offload branch 2 times, most recently from ac7e1f5 to a40e72e Compare December 19, 2024 14:03
@alexander-shabanov
Copy link
Author

alexander-shabanov commented Dec 20, 2024

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

From the tests and perf profiling it appears that main cause for performance improvement from this feature comes from eliminating expensive memory access to obj->ptr by main thread and much much less from eliminating copy to reply buffers. Without I/O threads, main thread still need to access obj->ptr and writev flow is a bit slower (requires additional preparation work) than plain write flow. I will publish results of various tests with and without I/O threads and with different data sizes on next week.

@alexander-shabanov
Copy link
Author

alexander-shabanov commented Dec 20, 2024

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.
The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

Very good questions. I will publish results of various tests with and without I/O threads and with different data sizes on next week. IMPORTANT NOTE: we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread (eliminating copy much less important).

@zuiderkwast
Copy link
Contributor

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

@alexander-shabanov
Copy link
Author

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

Published results of performance tests in the PR description

src/unit/test_networking.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/io_threads.c Show resolved Hide resolved
src/networking.c Show resolved Hide resolved
@zuiderkwast
Copy link
Contributor

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

@alexander-shabanov
Copy link
Author

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

@ranshid
Copy link
Member

ranshid commented Jan 2, 2025

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.
This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.
We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have.
I can think of some:

  1. Enable this feature only when the number of active IO-Threads is 8
  2. Track the CPU consumption on the engine thread and synamically enable the feature when the main engine is utilizing high CPU
  3. Maybe we can find a way to tag the reply object should be offloaded so that we will not get the memory access penalty
    And I am sure there are more.

src/unit/test_networking.c Show resolved Hide resolved
src/unit/test_networking.c Show resolved Hide resolved
src/cluster_slot_stats.c Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
c->io_last_bufpos = ((clientReplyBlock *)listNodeValue(c->io_last_reply_block))->used;
clientReplyBlock *block = (clientReplyBlock *)listNodeValue(c->io_last_reply_block);
c->io_last_bufpos = block->used;
/* If reply offload enabled force new header */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should indeed check if reply offload is enabled before writing NULL to the header? Checking a global is cheaper than write access to the block

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, with or without reply offload, main thread just finished to write to this block

src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
}

static inline int updatePayloadHeader(payloadHeader *last_header, uint8_t type, size_t len, int slot) {
if (last_header->type == type && last_header->slot == slot) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using for example MGET with small values that are not offloaded and come from different slots, wouldn't this cause multiple plain headers, thus requiring writeV instead of a simple write? We should investigate if this causes any performance degradation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CMD slot will be always equal to -1 (see if (!clusterSlotStatsEnabled(slot)) slot = -1; in upsertPayloadHeader)
In CME MGET keys must map to the same slot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does mean that some plans to support cross slot operations in CME would have to consider this as well.
I agree we can consider this to be solved when the time comes. probably per slot accounting will have to be refactored a bit in order to support cross slot commands.

@zuiderkwast
Copy link
Contributor

Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

Just like Ran and Madelyn, I don't want to introduce a config. There are many reasons or that. This is an optimization, not new functionality. Most users don't tweak configs like that. An optimization can change, but a config needs to be maintained and needs backward compatibility. The more configs we add, the harder it is for users to tune the combination of all configs, so we should do our best to find the best behavior automatically.

Second topic: There are three code paths worth considering.

  1. IO threads >= N. To avoid a memory access, we don't want to check the string size. Offload to IO thread.
  2. IO threads not active. Short string reply. The main thread copies the string to reply buffer.
  3. IO threads not active. Long string reply. When the main thread is about to copy the string to the reply buffer, it can see that the length is >= M bytes long (size threshold) and switch to this feature to avoid copying the string.

Case 3 is a very powerful optimization for long strings. Some users store large data in strings.

IO threads disabled is also not a corner case in any way. It is common to run small instances without IO threads and to scale horizontally using more cluster nodes instead of vertically using threads.

Instead of configs, I think we should pick some safe constants for N and M to make sure we don't get a regression in any case. I suggest N = 9 and M = 16K.

@alexander-shabanov
Copy link
Author

Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

Just like Ran and Madelyn, I don't want to introduce a config. There are many reasons or that. This is an optimization, not new functionality. Most users don't tweak configs like that. An optimization can change, but a config needs to be maintained and needs backward compatibility. The more configs we add, the harder it is for users to tune the combination of all configs, so we should do our best to find the best behavior automatically.

Second topic: There are three code paths worth considering.

1. IO threads >= N. To avoid a memory access, we don't want to check the string size. Offload to IO thread.

2. IO threads not active. Short string reply. The main thread copies the string to reply buffer.

3. IO threads not active. Long string reply. When the main thread is about to copy the string to the reply buffer, it can see that the length is >= M bytes long (size threshold) and switch to this feature to avoid copying the string.

Case 3 is a very powerful optimization for long strings. Some users store large data in strings.

IO threads disabled is also not a corner case in any way. It is common to run small instances without IO threads and to scale horizontally using more cluster nodes instead of vertically using threads.

Instead of configs, I think we should pick some safe constants for N and M to make sure we don't get a regression in any case. I suggest N = 9 and M = 16K.

@zuiderkwast We discussed it and going to propose in this PR to perform reply offload according to static number of I/O threads (i.e. io-threads config). Static number of I/O threads is more preferable than active I/O threads because it makes all the tests and troubleshooting of potential issues much more deterministic. I am working on final code changes and tests.

The reply offload activation according to size of object is deferred to another PR.

The best solution will be to perform reply offload according to actual main thread load matching the condition where reply offload is beneficial. However, it requires much more complex research.

@alexander-shabanov
Copy link
Author

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.
This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.
We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have. I can think of some:

1. Enable this feature only when the number of active IO-Threads is 8

2. Track the CPU consumption on the engine thread and synamically enable the feature when the main engine is utilizing high CPU

3. Maybe we can find a way to tag the reply object should be offloaded so that we will not get the memory access penalty
   And I am sure there are more.

@ranshid please see this reply

@ranshid
Copy link
Member

ranshid commented Jan 13, 2025

Thank you @alexander-shabanov . I think we need to be careful about this PR as this is near the release date. I think it is good we will focus on introducing a simple optimization and later focus on better optimization for dynamic support or large strings when io-threads are disabled.

@ranshid
Copy link
Member

ranshid commented Jan 13, 2025

@alexander-shabanov / @uriyage following the discussion with the maintainers let's try to answer some of the followup questions:

  1. In the performance results we see degradation when using small values which is "mitigated" when the number of io-threads is increased and we get better results when we offload the reply when the number of io-threads is higher than N. I suppose this is explained since the io-threads are not prefetching the values, so when the number of io-thraeds is small they are the bottleneck and thus we get degraded performance while when the number of io-threads is high the bottleneck shifts back to the engine - can you please verify that?

  2. Is does seem the better solution is to make dynamic decision based on the string size. IMO in order to better support it we would need to make use of the alternative solution (tagging the clientReplyBlock). I recall there were some concerns going with this option but would be happy if you can revive some benchmark results and bring more data.

@zuiderkwast
Copy link
Contributor

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

@ranshid
Copy link
Member

ranshid commented Jan 14, 2025

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

I agree it is a valid option and I think we already raised it some comments above but I think it is more related to the way we could dynamically enable this feature per object and not handle the degradation we observe for small objects. From the benchmark results you can see that even when there are no io-threads there is a slight degradation even though the engine does not access the string. IMO it is caused by the extra work to manage/access the headers.
I think we can either decide to allow this degradation (2.5% for the non-iothreads case which is not super bad...) or we can change the implementation to the alternative one (using reply block tagging). I think @alexander-shabanov has some insights about the issues with the alternative implementation which I would like to observe in order to take the decision.

@alexander-shabanov
Copy link
Author

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

This is the intention in "The reply offload activation according to size of object is deferred to another PR."

@alexander-shabanov
Copy link
Author

alexander-shabanov commented Jan 14, 2025

@alexander-shabanov / @uriyage following the discussion with the maintainers let's try to answer some of the followup questions:

1. In the performance results we see degradation when using small values which is "mitigated" when the number of io-threads is increased and we get better results when we offload the reply when the number of io-threads is higher than N. I suppose this is explained since the io-threads are not prefetching the values, so when the number of io-thraeds is small they are the bottleneck and thus we get degraded performance while when the number of io-threads is high the bottleneck shifts back to the engine - can you please verify that?

2. Is does seem the better solution is to make dynamic decision based on the string size. IMO in order to better support it we would need to make use of the alternative solution (tagging the clientReplyBlock). I recall there were some concerns going with this option but would be happy if you can revive some benchmark results and bring more data.

@ranshid

  1. We confirmed with profiling tool for small objects (e.g. 512 byte) and small number of I/O threads, I/O threads are bottleneck. The main reason is memory access done by I/O thread. In code published right now in PR, I/O thread accesses:
  • c->buf
  • obj
  • obj->ptr
    In upcoming code optimization we eliminated access to obj by I/O thread. Will publish this change soon. This improved performance a bit - reply offload started to be beneficial from a bit lower number of threads.
    Note: We tried to apply prefetching on I/O thread side but it did not help. We deferred deeper research of prefetching on I/O thread side.
  1. Solution based on clientReplyBlock has been evaluated in internal POC and found worse than solution proposed in this PR. It does not reduce memory access but require more memory allocations and adds an additional access to clientReplyBlock.

@alexander-shabanov
Copy link
Author

Just an idea: We could use a flag bit in robj to flag that a string value is larger than say 1K. Then, we can check this bit instead of reading the actual size.

I agree it is a valid option and I think we already raised it some comments above but I think it is more related to the way we could dynamically enable this feature per object and not handle the degradation we observe for small objects. From the benchmark results you can see that even when there are no io-threads there is a slight degradation even though the engine does not access the string. IMO it is caused by the extra work to manage/access the headers. I think we can either decide to allow this degradation (2.5% for the non-iothreads case which is not super bad...) or we can change the implementation to the alternative one (using reply block tagging). I think @alexander-shabanov has some insights about the issues with the alternative implementation which I would like to observe in order to take the decision.

please these replies:
#1457 (comment)
#1457 (comment)

@alexander-shabanov alexander-shabanov force-pushed the reply_offload branch 2 times, most recently from 6bf48aa to cb5e97f Compare January 16, 2025 11:14
@alexander-shabanov
Copy link
Author

@zuiderkwast @ranshid I uploaded new revision addressing main comment regarding reply offload efficiency starting certain number of threads. Updated description with changes in the code + config and performance test numbers

Copy link

codecov bot commented Jan 16, 2025

Codecov Report

Attention: Patch coverage is 92.53247% with 23 lines in your changes missing coverage. Please review.

Project coverage is 71.06%. Comparing base (12ec3d5) to head (b9173ee).
Report is 1 commits behind head on unstable.

Files with missing lines Patch % Lines
src/networking.c 94.62% 15 Missing ⚠️
src/io_threads.c 25.00% 6 Missing ⚠️
src/memory_prefetch.c 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1457      +/-   ##
============================================
+ Coverage     70.98%   71.06%   +0.08%     
============================================
  Files           121      121              
  Lines         65176    65387     +211     
============================================
+ Hits          46264    46467     +203     
- Misses        18912    18920       +8     
Files with missing lines Coverage Δ
src/cluster_slot_stats.c 94.18% <100.00%> (-0.17%) ⬇️
src/config.c 78.39% <ø> (ø)
src/replication.c 87.64% <100.00%> (+0.19%) ⬆️
src/server.h 100.00% <ø> (ø)
src/memory_prefetch.c 3.05% <0.00%> (-0.08%) ⬇️
src/io_threads.c 7.45% <25.00%> (+0.51%) ⬆️
src/networking.c 89.33% <94.62%> (+0.50%) ⬆️

... and 10 files with indirect coverage changes

@alexander-shabanov
Copy link
Author

Future Enhancements

The reply offload optimization in this PR is enabled according to (static) number of I/O threads with no regard to object size cause it is expected to benefit any bulk string reply starting certain number of threads.

As performance tests show, the reply offload can benefit large objects (e.g. 64 KB) starting much lower number of threads. In order to use this benefit following challenges should be addressed:

  1. string size detection in main thread without expensive memory access to obj→ptr
  2. allowing to mix “plain” (i.e. without headers) and “offloaded” (i.e. with headers) replies in reply buffers of the same client

The 1st challenge can be addressed by using flag on serverObject. The blocker for this approach is that right now all 64 bits on serverObject are already in use. Alternative can be next observation - if number of I/O threads lower than min-io-threads-reply-offload-on threshold main thread access obj→ptr anyway and can detect bulk reply is suitable for offload according to size.

The 2nd challenge can be addressed by using reply_offload flag per client reply buffer (i.e. c→buf or reply block buf) instead of having this flag per client. This will allow to move from “plain” to “offload” mode and back on the same client by adding new client reply block to c->reply .

So, potential solutions for these challenge should be researched, tested and tuned. We are pretty sure these solutions will leverage reply construction and write-to-client mechanisms introduced in this PR.

}

static inline int updatePayloadHeader(payloadHeader *last_header, uint8_t type, size_t len, int slot) {
if (last_header->type == type && last_header->slot == slot) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does mean that some plans to support cross slot operations in CME would have to consider this as well.
I agree we can consider this to be solved when the time comes. probably per slot accounting will have to be refactored a bit in order to support cross slot commands.

return C_ERR;
}

static size_t upsertPayloadHeader(char *buf, size_t *bufpos, payloadHeader **last_header, uint8_t type, size_t len, int slot, size_t available) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is upser? I would have thought a typo so but it is constantly used all over the code and comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we document what this internal function is responsible of doing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upsert is known terminology for update or Insert

Copy link
Member

@ranshid ranshid Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I now see upsert (need glasses)

Copy link
Contributor

@zuiderkwast zuiderkwast Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update-or-insert, I had to look it up actually. It wasn't in my vocabulary until now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ranshid upsertPayloadHeader body is commented thoroughly . Let me know if any specific comment required

src/networking.c Outdated
return 0;
}

switch (getClientType(c)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we verify all the cases of temp/fake clients? I honestly I am not sure if in all cases the client output will be forwarded to io-threads orare there cases were it is being used internally.

@ranshid ranshid added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jan 23, 2025
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Show resolved Hide resolved
@alexander-shabanov alexander-shabanov force-pushed the reply_offload branch 4 times, most recently from e5342fc to b9173ee Compare January 30, 2025 15:16
@alexander-shabanov alexander-shabanov changed the title Reply offload Copy Avoidance Jan 30, 2025
Signed-off-by: Alexander Shabanov <[email protected]>
Signed-off-by: Alexander Shabanov <[email protected]>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. I can't continue right now so I'm posting my pending comments. I've reviewed to the middle of networking.c to just before the iovec definitions.

I'm convinced this feature is worth the complexity.

Comment on lines 794 to 802
typedef struct clientReplyBlock {
size_t size, used;
union {
uint8_t raw_flag;
ClientReplyBlockFlags flag;
};
payloadHeader *last_header;
char buf[];
} clientReplyBlock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_header will be pointer-aligned by most compilers I guess, so there will be padding inserted between flags and last header. If we put last header before the flags union, we can avoid the padding? The buf[] is just char aligned.

(Is there where you added packed attribute in a previous commit? I think reordering the fields is somehow more nice than using the packed attribute, because aligned access is better in some architectures.)

Comment on lines +1223 to +1225
char *io_last_written_buf; /* Last buffer that has been written to the client connection */
size_t io_last_written_bufpos; /* The buffer has been written until this position */
size_t io_last_written_data_len; /* The actual length of the data written from this buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last buffer, can it be either c->buf or a c->reply list node, depending on what io_last_written_buf is pointing at?

Comment on lines +3258 to +3261
createIntConfig("min-io-threads-copy-avoid", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_io_threads_copy_avoid, 7, INTEGER_CONFIG, NULL, NULL),
createIntConfig("min-io-threads-value-prefetch-off", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_io_threads_value_prefetch_off, 10, INTEGER_CONFIG, NULL, NULL),
createIntConfig("min-string-size-copy-avoid", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_string_size_copy_avoid, 16384, INTEGER_CONFIG, NULL, NULL),
createIntConfig("min-string-size-copy-avoid-threaded", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, INT_MAX, server.min_string_size_copy_avoid_threaded, 65536, INTEGER_CONFIG, NULL, NULL),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy avoidance sounds good. Should we include the word reply too?

Ideas, not sure which is the best word order:

  • min-io-threads-reply-copy-avoid
  • min-io-threads-copy-reply-avoid
  • min-io-threads-avoid-copy-reply

Comment on lines +621 to +630

int isCopyAvoidIndicatedByIOThreads(void) {
/* Starting min_io_threads_for_copy_avoid I/O threads copy avoidance should be beneficial for any string size */
return server.min_io_threads_copy_avoid && server.io_threads_num >= server.min_io_threads_copy_avoid;
}

int isValuePrefetchIndicatedByIOThreads(void) {
/* Starting min_io_threads_value_prefetch_off I/O threads copy avoidance should be more efficient without value prefetch */
return server.io_threads_num < server.min_io_threads_value_prefetch_off;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are only called from one place and they just return data from the global server object, from configs. It seems to indicate they maybe don't belong in this file.

Since they are only called in one place, I'd say we can skip these functions and inline the code in memory_prefetch.c and networking.c (in isCopyAvoidAllowed) respectively.

src/networking.c Show resolved Hide resolved
* we store store pointers to object and string itself */
typedef struct __attribute__((__packed__)) bulkCopyAvoid {
robj *obj; /* pointer to object used for reference count management */
void *str; /* pointer to string to optimize memory access by I/O thread */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be sds str?

return _addReplyPayloadToBuffer(c, s, len, CLIENT_REPLY_PLAIN);
}

static size_t _addBulkOffloadToBuffer(client *c, const void *payload, size_t len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offload or CopyAvoid?

Let's think about the terminology for a bit. Maybe we can find some terms that are even more explicit about what the buffer contains.

We have three type of buffer content and we should ideally use different words for each of them:

  • unencoded buffer with PLAIN content, no payload headers
  • encoded buffer, payload header type PLAIN, plain content
  • encoded buffer, payload header type COPY-AVOID, content is a sequence of bulkCopyAvoid objects.

The overloaded use of the word PLAIN is a source of confusion here. I suggest we use PLAIN for unencoded buffer and another word such as INLINE for the payload header type for encoded buffer with plain content.

The bulkCopyAvoid struct is actually a reference to the object. That's why we increment the reference counter. 😀 So maybe we should use the word "reference" for this kind of buffer content, e.g. _addBulkRefToBuffer, and the payload header type can be CLIENT_REPLY_REF? Or use the word INDIRECT?

Please also add a short comment above each of these functions to explain what the function name itself doesn't explain, like what it means that if it returns 0, etc.

And one more thing. I think this function can take const bulkCopyAvoid * instead of void *payload and size_t len. Where this function calls _addReplyPayloadToBuffer we can spell out sizeof(bulkCopyAvoid) for the length argument.

@@ -1127,8 +1277,18 @@ void addReplyBulkLen(client *c, robj *obj) {
_addReplyLongLongWithPrefix(c, len, '$');
}

int tryOffloadBulkReply(client *c, robj *obj) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used only in this file? Make it static. (Shows the intent that it's a local function and it let's the compiler warn about unused functions.)

Rename "Offload"?

/* Adds the reply to the reply linked list.
* Note: some edits to this function need to be relayed to AddReplyFromClient. */
void _addReplyProtoToList(client *c, list *reply_list, const char *s, size_t len) {
static void _addReplyPayloadToList(client *c, list *reply_list, const char *payload, size_t len, uint8_t payload_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above belongs to _addReplyProtoToList I think...?

Please move it and write a more correct comment about _addReplyPayloadToList.

@@ -484,6 +621,18 @@ void _addReplyToBufferOrList(client *c, const char *s, size_t len) {
if (len > reply_len) _addReplyProtoToList(c, c->reply, s + reply_len, len - reply_len);
}

void _addBulkOffloadToBufferOrList(client *c, robj *obj) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename?

Add a short one-line comment as documentation.

Make it static. It's what we've decided to use for local functions. (Keep the underscore prefix if you want. I don't care. It's used in this file but not in other files.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants