"took too long to execute" always in etcd logs #10860

leon0306 · 2019-06-27T07:41:02Z

Please read https://github.com/etcd-io/etcd/blob/master/Documentation/reporting_bugs.md.

My etcd version is 3.3.10

backend_commit_duration_seconds like is

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))


{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.11571199999999937
{endpoint="metrics",instance="172.22.66.117:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.08104000000000033
{endpoint="metrics",instance="172.22.66.118:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.10784000000000088

wal_fsync_duration_seconds like is

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.0766933333333334
{endpoint="metrics",instance="172.22.66.117:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.04398222222222232
{endpoint="metrics",instance="172.22.66.118:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.04272000000000035

is the disk too slow cause this warning ?

The text was updated successfully, but these errors were encountered:

leon0306 · 2019-06-27T07:41:30Z

warning log like is

2019-06-25 08:47:33.191089 W | etcdserver: read-only range request "key:\"/registry/leases/kube-node-lease/m725-c115-26-k8s-pf1-elk01\" " with result "range_response_count:1 size:368" took too long (321.997639ms) to execute
2019-06-25 08:47:36.123405 W | etcdserver: read-only range request "key:\"/registry/priorityclasses\" range_end:\"/registry/priorityclasset\" count_only:true " with result "range_response_count:0 size:9" took too long (130.640623ms) to execute
2019-06-25 08:47:36.123845 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:457" took too long (114.454452ms) to execute
2019-06-25 08:47:41.924197 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:457" took too long (366.110214ms) to execute
2019-06-25 08:47:44.752364 W | etcdserver: read-only range request "key:\"/registry/configmaps/ingress-nginx/ingress-controller-leader-nginx\" " with result "range_response_count:1 size:463" took too long (130.214951ms) to execute

leon0306 · 2019-06-27T07:41:52Z

@jingyih

jingyih · 2019-06-29T03:16:55Z

@boxuan666 Thanks for reporting!

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
{endpoint="metrics",instance="172.22.66.116:2379",job="etcd",namespace="kube-system",service="etcd"} | 0.11571199999999937

Does this mean the P99 duration is 115.7ms? If so, it is too slow.

I am not super familair with histogram_quantile expression, maybe we can double check the P99 duration by searching keyword "backend_commit_duration_seconds" in metrics endpoint output. This way you will find the raw data, something like this:

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 123
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 124
...

leon0306 · 2019-07-01T03:01:30Z

@jingyih

Hi bro
The information as follow

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 4
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 484209
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 1.0856757e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 1.1494004e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 1.1693697e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 1.1750741e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 1.179604e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 1.1844565e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 1.1864375e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 1.1867404e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 1.1867526e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 1.186753e+07
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 1.186753e+07

jingyih · 2019-07-03T05:17:02Z

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms. Backend commit duration is usually the most dominant part in the duration of apply.

From the metrics you just posted, it looks like the P99 is in the order of 32 ms. But anything in 128 ms bucket and above will definitely trigger a warning, which in your case there are 20k or so. So it is expected that you will see a lot of the warnings in the server log. I think you should use a faster disk.

bbotte · 2019-08-02T03:39:32Z

Install etcd 3.3.13 or later to solve the problem, I have verified it.
No relationship with disk or network

aoliang · 2019-08-19T16:14:57Z

Install etcd 3.3.13 or later to solve the problem, I have verified it.
No relationship with disk or network

i want to know why.

anguslees · 2019-09-03T04:22:36Z

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .

(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

jingyih · 2019-09-03T05:03:00Z

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .

(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

Starting from 3.4, log level is configurable, which means you can choose to log only those levels higher than warning.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--log-level

anguslees · 2019-09-03T07:16:25Z

Starting from 3.4, log level is configurable

Thanks for the workaround. To be clear: I think you're saying I need to disable all warnings because the 100ms warning threshold is hard-coded. The various examples above indicate that 100ms is not an outlier for some deployments, and generally it seems impossible to mandate a fixed response SLA on behalf of all users.

I suggest we turn this issue into a feature request: "too-long warning threshold should be configurable" (or perhaps even just remove this warning and rely on metrics?)

agargi · 2020-11-04T04:09:42Z

I suggest we turn this issue into a feature request: "too-long warning threshold should be configurable" (or perhaps even just remove this warning and rely on metrics?)

Did a PR to introduce the threshold limit.

#12448

angeloudy · 2020-11-09T04:24:50Z

I am having exactly the same issue with rancher 2.4.3.
I am running prometheus with lots of alert rules, which creates a large configmap.
Every time I try to delete some prometheus alert rules, etcd will get stuck.

Tried with ebs optimised instance and porvissioned IOPS, not helping

yitian108 · 2024-01-24T03:31:38Z

Install etcd 3.3.13 or later to solve the problem, I have verified it. No relationship with disk or network

The problem still exists in 3.4.13, please refer to 17305

ostapkostyk · 2024-02-16T19:24:04Z

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .
(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

Starting from 3.4, log level is configurable, which means you can choose to log only those levels higher than warning. https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--log-level

Could you suggest how to apply that flag to existing etcd cluster? I have configured etcdctl but it doesn't have that flag. My etcd API version is 3.5. ETCD is embedded into k3s

ostapkostyk · 2024-02-16T19:45:07Z

The "took too long to execute" warning is triggered whenever applying a raft entry to backend store took more than 100 ms

How can I adjust this 100ms "took too long" warning threshold? I have an etcd cluster running on underpowered arm32 machines. I am quite happy with the speed, but now my logs are full of these "warnings", hiding real errors. I have already increased --heartbeat-interval and --election-timeout .
(Note: I'm running 3.3.10 - perhaps this is already improved in later versions?)

Starting from 3.4, log level is configurable, which means you can choose to log only those levels higher than warning. https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--log-level

Could you suggest how to apply that flag to existing etcd cluster? I have configured etcdctl but it doesn't have that flag. My etcd API version is 3.5. ETCD is embedded into k3s

ok, I found config file /var/lib/rancher/k3s/server/db/etcd/config and added the flag. How to apply a new configuration to a running cluster?

spzala mentioned this issue Jun 27, 2019

"took too long to execute" #10859

Closed

spzala added the type/question label Jun 27, 2019

leon0306 closed this as completed Nov 20, 2019

agargi mentioned this issue Oct 20, 2020

etcdserver took too long to execute #12373

Closed

agargi mentioned this issue Nov 4, 2020

server: Added config parameter experimental-apply-warning-duration #12448

Merged

cwjohnston mentioned this issue Dec 21, 2020

Document etcd sensitivity to disk latency sensu/sensu-docs#2917

Closed

4 tasks

trisomeyr mentioned this issue Feb 16, 2021

v3.4.14; etcd flag provided but not defined: -experimental-warning-apply-duration #12693

Closed

spzala mentioned this issue Apr 13, 2021

etcd SIGILL: illegal instruction #12761

Closed

xiaods mentioned this issue May 17, 2022

etcd leader change leads to "timed out waiting for read index response" #14027

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"took too long to execute" always in etcd logs #10860

"took too long to execute" always in etcd logs #10860

leon0306 commented Jun 27, 2019

leon0306 commented Jun 27, 2019

leon0306 commented Jun 27, 2019

jingyih commented Jun 29, 2019

leon0306 commented Jul 1, 2019

jingyih commented Jul 3, 2019

bbotte commented Aug 2, 2019 •

edited

Loading

aoliang commented Aug 19, 2019

anguslees commented Sep 3, 2019

jingyih commented Sep 3, 2019

anguslees commented Sep 3, 2019

agargi commented Nov 4, 2020

angeloudy commented Nov 9, 2020

yitian108 commented Jan 24, 2024

ostapkostyk commented Feb 16, 2024

ostapkostyk commented Feb 16, 2024

"took too long to execute" always in etcd logs #10860

"took too long to execute" always in etcd logs #10860

Comments

leon0306 commented Jun 27, 2019

leon0306 commented Jun 27, 2019

leon0306 commented Jun 27, 2019

jingyih commented Jun 29, 2019

leon0306 commented Jul 1, 2019

jingyih commented Jul 3, 2019

bbotte commented Aug 2, 2019 • edited Loading

aoliang commented Aug 19, 2019

anguslees commented Sep 3, 2019

jingyih commented Sep 3, 2019

anguslees commented Sep 3, 2019

agargi commented Nov 4, 2020

angeloudy commented Nov 9, 2020

yitian108 commented Jan 24, 2024

ostapkostyk commented Feb 16, 2024

ostapkostyk commented Feb 16, 2024

bbotte commented Aug 2, 2019 •

edited

Loading