Fix intermittent cpu spike in grpc async client#30123
Conversation
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
| // the difference between them is less than 1 second. If we don't do this, the timer will | ||
| // be enabled with 0 seconds, which will cause the timer to fire immediately. This will | ||
| // cause cpu spike. | ||
| (std ::chrono::duration_cast<std::chrono::seconds>(next_expire - now).count() == 0)) { |
There was a problem hiding this comment.
Do we have a good way to test this? How did you chose 1 second as a good threshold?
There was a problem hiding this comment.
thanks Lizan for taking a look.
How did you chose 1 second as a good threshold?
While enabling timer at L218, duration is round-off to seconds. Spike happens when this round-off result is 0. Hence I used same rounding-off in the if condition to avoid timer initialiazation with 0s. From the logs, that I shared in PR description as well, when I figured out that if difference between next_expire and now is less than a sec it gets rounded off to 0, hence mentioned 1 sec in code comment. May be it gets round off to 0 only when diff is less than .5sec, I am not exactly sure about this round-off.
Do we have a good way to test this?
I will go through existing tests and see if we have a way to test this out
|
As an alternative to the current implementation, can we make the timer sleep for 1 second when the time interval is less than 1 second to avoid the busy loop? |
yeah we can do that. Actually initially I thought of doing same, but since it require adding another @lizan wdyt? |
|
I'm fine either way too. If you can add a test that would be nice. |
| while (!lru_list_.empty()) { | ||
| MonotonicTime next_expire = lru_list_.back().accessed_time_ + EntryTimeoutInterval; | ||
| if (now >= next_expire) { | ||
| std::chrono::milliseconds time_to_next_expire_sec = |
There was a problem hiding this comment.
| std::chrono::milliseconds time_to_next_expire_sec = | |
| std::chrono::seconds time_to_next_expire_sec = |
| if ((now >= next_expire) || | ||
| // since 'now' and 'next_expire' are in nanoseconds, the following condition is to | ||
| // check if the difference between them is less than 1 second. If we don't do this, the | ||
| // timer will be enabled with 0 seconds, which will cause the timer to fire immediately. | ||
| // This will cause cpu spike. | ||
| (time_to_next_expire_sec.count() == 0)) { |
There was a problem hiding this comment.
this can be simplified to time_to_next_expire_sec.count() <= 0?
| } else { | ||
| cache_eviction_timer_->enableTimer( | ||
| std::chrono::duration_cast<std::chrono::seconds>(next_expire - now)); | ||
| if (time_to_next_expire_sec.count() == 0) { |
There was a problem hiding this comment.
this never happen because the if above.
| const RawAsyncClientSharedPtr& client); | ||
|
|
||
| RawAsyncClientSharedPtr getCache(const GrpcServiceConfigWithHashKey& config_with_hash_key); | ||
| int32_t timer_enabled_with_0_duration_count_ = 0; |
There was a problem hiding this comment.
you shouldn't need this just for test. if you want to test you should use a mock timer.
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> fix tests to match older codebase Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> fix tests to match older codebase Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Sean Killeen <SeanKilleen@gmail.com>
Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Sean Killeen <SeanKilleen@gmail.com>
Commit Message:
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.
I added some logs while debugging it. Sharing here might help understanding the issue more clearly:
When this condition hits, logs get filled because of the loop I mentioned in the beginning.
Fix is simple that in the
ifcondition for expiry consider this round off thing as well.Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]