Fix intermittent cpu spike in grpc async client by vikaschoudhary16 · Pull Request #30123 · envoyproxy/envoy

vikaschoudhary16 · 2023-10-12T00:17:20Z

Commit Message:
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.

I added some logs while debugging it. Sharing here might help understanding the issue more clearly:

AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0

When this condition hits, logs get filled because of the loop I mentioned in the beginning.

Fix is simple that in the if condition for expiry consider this round off thing as well.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

vikaschoudhary16 · 2023-10-12T01:48:52Z

ping @chaoqin-li1123 @jmarantz @KBaichoo @htuch

lizan · 2023-10-12T21:05:09Z

source/common/grpc/async_client_manager_impl.cc

+        // the difference between them is less than 1 second. If we don't do this, the timer will
+        // be enabled with 0 seconds, which will cause the timer to fire immediately. This will
+        // cause cpu spike.
+        (std ::chrono::duration_cast<std::chrono::seconds>(next_expire - now).count() == 0)) {


Do we have a good way to test this? How did you chose 1 second as a good threshold?

thanks Lizan for taking a look.

How did you chose 1 second as a good threshold?

While enabling timer at L218, duration is round-off to seconds. Spike happens when this round-off result is 0. Hence I used same rounding-off in the if condition to avoid timer initialiazation with 0s. From the logs, that I shared in PR description as well, when I figured out that if difference between next_expire and now is less than a sec it gets rounded off to 0, hence mentioned 1 sec in code comment. May be it gets round off to 0 only when diff is less than .5sec, I am not exactly sure about this round-off.

Do we have a good way to test this?

I will go through existing tests and see if we have a way to test this out

chaoqin-li1123 · 2023-10-12T23:11:00Z

As an alternative to the current implementation, can we make the timer sleep for 1 second when the time interval is less than 1 second to avoid the busy loop?

vikaschoudhary16 · 2023-10-15T00:49:19Z

As an alternative to the current implementation, can we make the timer sleep for 1 second when the time interval is less than 1 second to avoid the busy loop?

yeah we can do that. Actually initially I thought of doing same, but since it require adding another else section, I thought current implementation is simpler in terms of code readability. Behavior wise, it will delay the timer by fraction of secs, which is also fine. I am fine either way.

@lizan wdyt?

lizan · 2023-10-15T23:21:08Z

I'm fine either way too. If you can add a test that would be nice.

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

lizan · 2023-10-18T07:49:49Z

source/common/grpc/async_client_manager_impl.cc

  while (!lru_list_.empty()) {
    MonotonicTime next_expire = lru_list_.back().accessed_time_ + EntryTimeoutInterval;
-    if (now >= next_expire) {
+    std::chrono::milliseconds time_to_next_expire_sec =


Suggested change

std::chrono::milliseconds time_to_next_expire_sec =

std::chrono::seconds time_to_next_expire_sec =

lizan · 2023-10-18T07:50:25Z

source/common/grpc/async_client_manager_impl.cc

+    if ((now >= next_expire) ||
+        // since 'now' and 'next_expire' are in nanoseconds, the following condition is to
+        // check if the difference between them is less than 1 second. If we don't do this, the
+        // timer will be enabled with 0 seconds, which will cause the timer to fire immediately.
+        // This will cause cpu spike.
+        (time_to_next_expire_sec.count() == 0)) {


this can be simplified to time_to_next_expire_sec.count() <= 0?

lizan · 2023-10-18T07:53:53Z

source/common/grpc/async_client_manager_impl.cc

    } else {
-      cache_eviction_timer_->enableTimer(
-          std::chrono::duration_cast<std::chrono::seconds>(next_expire - now));
+      if (time_to_next_expire_sec.count() == 0) {


this never happen because the if above.

lizan · 2023-10-18T07:54:58Z

source/common/grpc/async_client_manager_impl.h

                  const RawAsyncClientSharedPtr& client);

    RawAsyncClientSharedPtr getCache(const GrpcServiceConfigWithHashKey& config_with_hash_key);
+    int32_t timer_enabled_with_0_duration_count_ = 0;


you shouldn't need this just for test. if you want to test you should use a mock timer.

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> fix tests to match older codebase Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>

Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>

Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> fix tests to match older codebase Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>

In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies. I added some logs while debugging it. Sharing here might help understanding the issue more clearly: ``` AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741 [2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0 ``` When this condition hits, logs get filled because of the loop I mentioned in the beginning. Fix is simple that in the `if` condition for expiry consider this round off thing as well. Additional Description: Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Sean Killeen <SeanKilleen@gmail.com>

Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com> Signed-off-by: Sean Killeen <SeanKilleen@gmail.com>

vikaschoudhary16 added 2 commits October 12, 2023 05:37

Fix intermittent cpu spike in grpc async client

d67ff48

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

format

74095b1

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

jmarantz assigned lizan Oct 12, 2023

lizan reviewed Oct 12, 2023

View reviewed changes

lizan added the waiting:any label Oct 12, 2023

repokitteh-read-only bot removed waiting:any labels Oct 12, 2023

lizan added the waiting label Oct 17, 2023

Add test

f5fe409

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

repokitteh-read-only bot removed the waiting label Oct 18, 2023

vikaschoudhary16 requested a review from lizan October 18, 2023 06:47

vikaschoudhary16 added 2 commits October 18, 2023 12:43

format

283bf6a

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

spelling

4a01770

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

lizan reviewed Oct 18, 2023

View reviewed changes

Address feedback

89ed5d0

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

vikaschoudhary16 requested a review from lizan October 18, 2023 12:01

lizan approved these changes Oct 20, 2023

View reviewed changes

lizan merged commit 01ae32b into envoyproxy:main Oct 20, 2023

vikaschoudhary16 mentioned this pull request Oct 23, 2023

[1.28 backport] Fix intermittent cpu spike in grpc async client (#30123) #30400

Merged

vikaschoudhary16 mentioned this pull request Oct 23, 2023

[backport 1.27] Fix intermittent cpu spike in grpc async client (#30123) #30401

Merged

vikaschoudhary16 added a commit to vikaschoudhary16/envoy that referenced this pull request Oct 23, 2023

Add release-notes for fix merged in envoyproxy#30123

0f28d93

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

vikaschoudhary16 mentioned this pull request Oct 23, 2023

Add release-notes for fix merged in #30123 #30408

Merged

phlax pushed a commit that referenced this pull request Oct 23, 2023

Add release-notes for fix merged in #30123 (#30408)

ab6bb5b

Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>

phlax pushed a commit that referenced this pull request Oct 24, 2023

Add release-notes for fix merged in #30123 (#30408)

e8c2a26

Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com> Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intermittent cpu spike in grpc async client#30123

Fix intermittent cpu spike in grpc async client#30123
lizan merged 6 commits intoenvoyproxy:mainfrom
vikaschoudhary16:cpu-spike

vikaschoudhary16 commented Oct 12, 2023 •

edited

Loading

Uh oh!

vikaschoudhary16 commented Oct 12, 2023

Uh oh!

lizan Oct 12, 2023

Uh oh!

vikaschoudhary16 Oct 12, 2023 •

edited

Loading

Uh oh!

chaoqin-li1123 commented Oct 12, 2023 •

edited

Loading

Uh oh!

vikaschoudhary16 commented Oct 15, 2023

Uh oh!

lizan commented Oct 15, 2023

Uh oh!

lizan Oct 18, 2023

Uh oh!

lizan Oct 18, 2023

Uh oh!

lizan Oct 18, 2023

Uh oh!

lizan Oct 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	std::chrono::milliseconds time_to_next_expire_sec =
	std::chrono::seconds time_to_next_expire_sec =

Conversation

vikaschoudhary16 commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikaschoudhary16 commented Oct 12, 2023

Uh oh!

lizan Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

vikaschoudhary16 Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaoqin-li1123 commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikaschoudhary16 commented Oct 15, 2023

Uh oh!

lizan commented Oct 15, 2023

Uh oh!

lizan Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

lizan Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

lizan Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

lizan Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vikaschoudhary16 commented Oct 12, 2023 •

edited

Loading

vikaschoudhary16 Oct 12, 2023 •

edited

Loading

chaoqin-li1123 commented Oct 12, 2023 •

edited

Loading