Skip to content

Fix intermittent cpu spike in grpc async client#30123

Merged
lizan merged 6 commits intoenvoyproxy:mainfrom
vikaschoudhary16:cpu-spike
Oct 20, 2023
Merged

Fix intermittent cpu spike in grpc async client#30123
lizan merged 6 commits intoenvoyproxy:mainfrom
vikaschoudhary16:cpu-spike

Conversation

@vikaschoudhary16
Copy link
Copy Markdown
Contributor

@vikaschoudhary16 vikaschoudhary16 commented Oct 12, 2023

Commit Message:
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.

I added some logs while debugging it. Sharing here might help understanding the issue more clearly:

AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0

When this condition hits, logs get filled because of the loop I mentioned in the beginning.

Fix is simple that in the if condition for expiry consider this round off thing as well.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

// the difference between them is less than 1 second. If we don't do this, the timer will
// be enabled with 0 seconds, which will cause the timer to fire immediately. This will
// cause cpu spike.
(std ::chrono::duration_cast<std::chrono::seconds>(next_expire - now).count() == 0)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a good way to test this? How did you chose 1 second as a good threshold?

Copy link
Copy Markdown
Contributor Author

@vikaschoudhary16 vikaschoudhary16 Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Lizan for taking a look.

How did you chose 1 second as a good threshold?

While enabling timer at L218, duration is round-off to seconds. Spike happens when this round-off result is 0. Hence I used same rounding-off in the if condition to avoid timer initialiazation with 0s. From the logs, that I shared in PR description as well, when I figured out that if difference between next_expire and now is less than a sec it gets rounded off to 0, hence mentioned 1 sec in code comment. May be it gets round off to 0 only when diff is less than .5sec, I am not exactly sure about this round-off.

Do we have a good way to test this?

I will go through existing tests and see if we have a way to test this out

@chaoqin-li1123
Copy link
Copy Markdown
Contributor

chaoqin-li1123 commented Oct 12, 2023

As an alternative to the current implementation, can we make the timer sleep for 1 second when the time interval is less than 1 second to avoid the busy loop?

@vikaschoudhary16
Copy link
Copy Markdown
Contributor Author

As an alternative to the current implementation, can we make the timer sleep for 1 second when the time interval is less than 1 second to avoid the busy loop?

yeah we can do that. Actually initially I thought of doing same, but since it require adding another else section, I thought current implementation is simpler in terms of code readability. Behavior wise, it will delay the timer by fraction of secs, which is also fine. I am fine either way.

@lizan wdyt?

@lizan
Copy link
Copy Markdown
Member

lizan commented Oct 15, 2023

I'm fine either way too. If you can add a test that would be nice.

@lizan lizan added the waiting label Oct 17, 2023
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
while (!lru_list_.empty()) {
MonotonicTime next_expire = lru_list_.back().accessed_time_ + EntryTimeoutInterval;
if (now >= next_expire) {
std::chrono::milliseconds time_to_next_expire_sec =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::chrono::milliseconds time_to_next_expire_sec =
std::chrono::seconds time_to_next_expire_sec =

Comment on lines +211 to +216
if ((now >= next_expire) ||
// since 'now' and 'next_expire' are in nanoseconds, the following condition is to
// check if the difference between them is less than 1 second. If we don't do this, the
// timer will be enabled with 0 seconds, which will cause the timer to fire immediately.
// This will cause cpu spike.
(time_to_next_expire_sec.count() == 0)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be simplified to time_to_next_expire_sec.count() <= 0?

} else {
cache_eviction_timer_->enableTimer(
std::chrono::duration_cast<std::chrono::seconds>(next_expire - now));
if (time_to_next_expire_sec.count() == 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this never happen because the if above.

const RawAsyncClientSharedPtr& client);

RawAsyncClientSharedPtr getCache(const GrpcServiceConfigWithHashKey& config_with_hash_key);
int32_t timer_enabled_with_0_duration_count_ = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need this just for test. if you want to test you should use a mock timer.

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
@lizan lizan merged commit 01ae32b into envoyproxy:main Oct 20, 2023
vikaschoudhary16 added a commit to vikaschoudhary16/envoy that referenced this pull request Oct 23, 2023
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.

I added some logs while debugging it. Sharing here might help understanding the issue more clearly:
```
AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
```
When this condition hits, logs get filled because of the loop I mentioned in the beginning.

Fix is simple that in the `if` condition for expiry consider this round off thing as well.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
phlax pushed a commit that referenced this pull request Oct 23, 2023
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.

I added some logs while debugging it. Sharing here might help understanding the issue more clearly:
```
AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
```
When this condition hits, logs get filled because of the loop I mentioned in the beginning.

Fix is simple that in the `if` condition for expiry consider this round off thing as well.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
vikaschoudhary16 added a commit to vikaschoudhary16/envoy that referenced this pull request Oct 23, 2023
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
vikaschoudhary16 added a commit to vikaschoudhary16/envoy that referenced this pull request Oct 23, 2023
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.

I added some logs while debugging it. Sharing here might help understanding the issue more clearly:
```
AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
```
When this condition hits, logs get filled because of the loop I mentioned in the beginning.

Fix is simple that in the `if` condition for expiry consider this round off thing as well.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

fix tests to match older codebase

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
phlax pushed a commit that referenced this pull request Oct 23, 2023
Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
vikaschoudhary16 added a commit to vikaschoudhary16/envoy that referenced this pull request Oct 24, 2023
Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
phlax pushed a commit that referenced this pull request Oct 24, 2023
Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
phlax pushed a commit that referenced this pull request Oct 24, 2023
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.

I added some logs while debugging it. Sharing here might help understanding the issue more clearly:
```
AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
```
When this condition hits, logs get filled because of the loop I mentioned in the beginning.

Fix is simple that in the `if` condition for expiry consider this round off thing as well.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

fix tests to match older codebase

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>

Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
SeanKilleen pushed a commit to SeanKilleen/envoy that referenced this pull request Apr 3, 2024
In grpc async client, if timer expiry handler function gets fired where time to next_expiry is less than 1 sec, a loop gets created where timer gets enabled with 0 secs expiry again and again until 'now' becomes as next_expiry. This causes random cpu spikes. If there is HPA configured on cpu, this will result in scale up and scale down on proxies.

I added some logs while debugging it. Sharing here might help understanding the issue more clearly:
```
AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877367850
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:193] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:198] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer next_expire: 966826645315069, now 966825877389741
[2023-09-08 10:27:32.640][22][info][grpc] [external/envoy/source/common/grpc/async_client_manager_impl.cc:208] AsyncClientManagerImpl::evictEntriesAndResetEvictionTimer enable timer: 0
```
When this condition hits, logs get filled because of the loop I mentioned in the beginning.

Fix is simple that in the `if` condition for expiry consider this round off thing as well.

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
Signed-off-by: Sean Killeen <SeanKilleen@gmail.com>
SeanKilleen pushed a commit to SeanKilleen/envoy that referenced this pull request Apr 3, 2024
Signed-off-by: Vikas Choudhary (vikasc) <choudharyvikas16@gmail.com>
Signed-off-by: Vikas Choudhary <choudharyvikas16@gmail.com>
Signed-off-by: Sean Killeen <SeanKilleen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants