Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"took too long to execute" always in etcd logs #10860

Closed
leon0306 opened this issue Jun 27, 2019 · 15 comments
Closed

"took too long to execute" always in etcd logs #10860

leon0306 opened this issue Jun 27, 2019 · 15 comments

Comments

@leon0306
Copy link

Please read https://github.com/etcd-io/etcd/blob/master/Documentation/reporting_bugs.md.

My etcd version is 3.3.10

backend_commit_duration_seconds like is

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))


{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.11571199999999937
{endpoint="metrics",instance="172.22.66.117:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.08104000000000033
{endpoint="metrics",instance="172.22.66.118:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.10784000000000088

wal_fsync_duration_seconds like is

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.0766933333333334
{endpoint="metrics",instance="172.22.66.117:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.04398222222222232
{endpoint="metrics",instance="172.22.66.118:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.04272000000000035

is the disk too slow cause this warning ?

@leon0306
Copy link
Author

warning log like is

2019-06-25 08:47:33.191089 W | etcdserver: read-only range request "key:\"/registry/leases/kube-node-lease/m725-c115-26-k8s-pf1-elk01\" " with result "range_response_count:1 size:368" took too long (321.997639ms) to execute
2019-06-25 08:47:36.123405 W | etcdserver: read-only range request "key:\"/registry/priorityclasses\" range_end:\"/registry/priorityclasset\" count_only:true " with result "range_response_count:0 size:9" took too long (130.640623ms) to execute
2019-06-25 08:47:36.123845 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:457" took too long (114.454452ms) to execute
2019-06-25 08:47:41.924197 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:457" took too long (366.110214ms) to execute
2019-06-25 08:47:44.752364 W | etcdserver: read-only range request "key:\"/registry/configmaps/ingress-nginx/ingress-controller-leader-nginx\" " with result "range_response_count:1 size:463" took too long (130.214951ms) to execute

@leon0306
Copy link
Author

@jingyih

@jingyih
Copy link
Contributor

jingyih commented Jun 29, 2019

@boxuan666 Thanks for reporting!

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.11571199999999937

Does this mean the P99 duration is 115.7ms? If so, it is too slow.

I am not super familair with histogram_quantile expression, maybe we can double check the P99 duration by searching keyword "backend_commit_duration_seconds" in metrics endpoint output. This way you will find the raw data, something like this:

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 123
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 124
...

@leon0306
Copy link
Author

leon0306 commented Jul 1, 2019

@jingyih

Hi bro
The information as follow

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 4
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 484209
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 1.0856757e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 1.1494004e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 1.1693697e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 1.1750741e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 1.179604e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 1.1844565e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 1.1864375e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 1.1867404e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 1.1867526e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 1.186753e+07

@jingyih
Copy link
Contributor

jingyih commented Jul 3, 2019

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms. Backend commit duration is usually the most dominant part in the duration of apply.

From the metrics you just posted, it looks like the P99 is in the order of 32 ms. But anything in 128 ms bucket and above will definitely trigger a warning, which in your case there are 20k or so. So it is expected that you will see a lot of the warnings in the server log. I think you should use a faster disk.

@bbotte
Copy link

bbotte commented Aug 2, 2019

Install etcd 3.3.13 or later to solve the problem, I have verified it.
No relationship with disk or network

@aoliang
Copy link

aoliang commented Aug 19, 2019

Install etcd 3.3.13 or later to solve the problem, I have verified it.
No relationship with disk or network

i want to know why.

@anguslees
Copy link
Contributor

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .

(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

@jingyih
Copy link
Contributor

jingyih commented Sep 3, 2019

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .

(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

Starting from 3.4, log level is configurable, which means you can choose to log only those levels higher than warning.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--log-level

@anguslees
Copy link
Contributor

Starting from 3.4, log level is configurable

Thanks for the workaround. To be clear: I think you're saying I need to disable all warnings because the 100ms warning threshold is hard-coded. The various examples above indicate that 100ms is not an outlier for some deployments, and generally it seems impossible to mandate a fixed response SLA on behalf of all users.

I suggest we turn this issue into a feature request: "too-long warning threshold should be configurable" (or perhaps even just remove this warning and rely on metrics?)

@agargi
Copy link
Contributor

agargi commented Nov 4, 2020

I suggest we turn this issue into a feature request: "too-long warning threshold should be configurable" (or perhaps even just remove this warning and rely on metrics?)

Did a PR to introduce the threshold limit.

#12448

@angeloudy
Copy link

I am having exactly the same issue with rancher 2.4.3.
I am running prometheus with lots of alert rules, which creates a large configmap.
Every time I try to delete some prometheus alert rules, etcd will get stuck.

Tried with ebs optimised instance and porvissioned IOPS, not helping

@yitian108
Copy link

Install etcd 3.3.13 or later to solve the problem, I have verified it. No relationship with disk or network

The problem still exists in 3.4.13, please refer to 17305

@ostapkostyk
Copy link

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .
(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

Starting from 3.4, log level is configurable, which means you can choose to log only those levels higher than warning. https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--log-level

Could you suggest how to apply that flag to existing etcd cluster? I have configured etcdctl but it doesn't have that flag. My etcd API version is 3.5. ETCD is embedded into k3s

@ostapkostyk
Copy link

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .
(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

Starting from 3.4, log level is configurable, which means you can choose to log only those levels higher than warning. https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--log-level

Could you suggest how to apply that flag to existing etcd cluster? I have configured etcdctl but it doesn't have that flag. My etcd API version is 3.5. ETCD is embedded into k3s

ok, I found config file /var/lib/rancher/k3s/server/db/etcd/config and added the flag. How to apply a new configuration to a running cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

10 participants