Skip to content

feat(kubernetes_logs): Add backoff for watchers#17009

Merged
spencergilbert merged 2 commits intovectordotdev:masterfrom
deckhouse:backoff-watcher
Mar 31, 2023
Merged

feat(kubernetes_logs): Add backoff for watchers#17009
spencergilbert merged 2 commits intovectordotdev:masterfrom
deckhouse:backoff-watcher

Conversation

@nabokihms
Copy link
Contributor

Description

On startup, if Kubernetes API is unavailable or responds with an error, vector, without waiting, tries to send a request again.

Vector logs:

2023-03-29T17:28:32.024522Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Service(Closed))
2023-03-29T17:28:32.024537Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Service(Closed))
2023-03-29T17:28:32.024551Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Service(Closed))
2023-03-29T17:28:32.024569Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Service(Closed))

Pay attention to timestamps. These messages continue until vector consumes all the CPU and memory and dies.

Solution

Add default backoff to watchers.
https://docs.rs/kube/latest/kube/runtime/utils/struct.StreamBackoff.html
https://docs.rs/kube/latest/kube/runtime/watcher/fn.default_backoff.html

More context can be found here
kube-rs/kube#717 (comment)

Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>
@nabokihms nabokihms requested a review from a team March 30, 2023 17:47
@netlify
Copy link

netlify bot commented Mar 30, 2023

Deploy Preview for vector-project canceled.

Name Link
🔨 Latest commit 032b571
🔍 Latest deploy log https://app.netlify.com/sites/vector-project/deploys/6425dbc0b5e6620008d20b22

@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Mar 30, 2023
@github-actions
Copy link

Regression Detector Results

Run ID: 6de870c1-6e0a-497a-bd12-365543bf210d
Baseline: 401974e
Comparison: e219017
Total vector CPUs: 7

Explanation

A regression test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine quickly if vector performance is changed and to what degree by a pull request.

The table below, if present, lists those experiments that have experienced a statistically significant change in mean optimization goal performance between baseline and comparison SHAs with 90.00% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±5.00% change in their mean optimization goal are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting change is observed.

No interesting changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%.

Fine details of change detection per experiment.
experiment goal Δ mean Δ mean % confidence baseline mean baseline stdev baseline stderr baseline outlier % baseline CoV comparison mean comparison stdev comparison stderr comparison outlier % comparison CoV erratic declared erratic
file_to_blackhole egress throughput 874.96KiB/CPU-s 13.75 100.00% 6.21MiB/CPU-s 4.17MiB/CPU-s 124.41KiB/CPU-s 0.0 0.67078 7.07MiB/CPU-s 3.86MiB/CPU-s 127.67KiB/CPU-s 0.0 0.545371 True True
http_text_to_http_json ingress throughput 730.51KiB/CPU-s 2.90 100.00% 24.64MiB/CPU-s 861.66KiB/CPU-s 10.6KiB/CPU-s 0.0 0.034151 25.35MiB/CPU-s 711.87KiB/CPU-s 8.76KiB/CPU-s 0.0 0.02742 False False
otlp_grpc_to_blackhole ingress throughput 27.24KiB/CPU-s 2.67 100.00% 1018.76KiB/CPU-s 53.0KiB/CPU-s 667.96B/CPU-s 0.0 0.052018 1.02MiB/CPU-s 43.98KiB/CPU-s 554.21B/CPU-s 0.0 0.042042 False False
syslog_log2metric_humio_metrics ingress throughput 149.19KiB/CPU-s 2.47 100.00% 5.9MiB/CPU-s 469.36KiB/CPU-s 5.78KiB/CPU-s 0.0 0.077624 6.05MiB/CPU-s 353.01KiB/CPU-s 4.34KiB/CPU-s 0.0 0.056976 False False
datadog_agent_remap_datadog_logs_acks ingress throughput 595.63KiB/CPU-s 1.86 100.00% 31.22MiB/CPU-s 1.98MiB/CPU-s 24.92KiB/CPU-s 0.0 0.063354 31.8MiB/CPU-s 1.46MiB/CPU-s 18.39KiB/CPU-s 0.0 0.045891 False False
datadog_agent_remap_datadog_logs ingress throughput 426.52KiB/CPU-s 1.26 100.00% 33.17MiB/CPU-s 930.48KiB/CPU-s 11.45KiB/CPU-s 0.0 0.027391 33.59MiB/CPU-s 784.13KiB/CPU-s 9.65KiB/CPU-s 0.0 0.022797 False False
datadog_agent_remap_blackhole ingress throughput 306.97KiB/CPU-s 1.00 100.00% 30.06MiB/CPU-s 1.32MiB/CPU-s 16.57KiB/CPU-s 0.0 0.043765 30.36MiB/CPU-s 1.08MiB/CPU-s 13.62KiB/CPU-s 0.0 0.035612 False False
http_to_http_acks ingress throughput 42.83KiB/CPU-s 0.80 61.64% 5.22MiB/CPU-s 2.73MiB/CPU-s 34.34KiB/CPU-s 0.0 0.521934 5.26MiB/CPU-s 2.79MiB/CPU-s 35.17KiB/CPU-s 0.0 0.530297 True False
socket_to_socket_blackhole ingress throughput 69.74KiB/CPU-s 0.51 100.00% 13.44MiB/CPU-s 441.78KiB/CPU-s 5.44KiB/CPU-s 0.0 0.032106 13.5MiB/CPU-s 365.39KiB/CPU-s 4.5KiB/CPU-s 0.0 0.02642 False False
enterprise_http_to_http ingress throughput 7.49KiB/CPU-s 0.05 94.61% 13.62MiB/CPU-s 277.14KiB/CPU-s 3.41KiB/CPU-s 0.0 0.019874 13.62MiB/CPU-s 151.43KiB/CPU-s 1.86KiB/CPU-s 0.0 0.010854 False False
splunk_hec_to_splunk_hec_logs_noack ingress throughput 1.22KiB/CPU-s 0.01 20.62% 13.61MiB/CPU-s 270.93KiB/CPU-s 3.33KiB/CPU-s 0.0 0.019433 13.62MiB/CPU-s 264.61KiB/CPU-s 3.25KiB/CPU-s 0.0 0.018978 False False
fluent_elasticsearch ingress throughput 90.97B/CPU-s 0.00 13.32% 45.41MiB/CPU-s 30.45KiB/CPU-s 379.24B/CPU-s 0.0 0.000655 45.41MiB/CPU-s 29.65KiB/CPU-s 387.55B/CPU-s 0.0 0.000638 False False
splunk_hec_to_splunk_hec_logs_acks ingress throughput -450.03B/CPU-s -0.00 5.71% 13.61MiB/CPU-s 342.22KiB/CPU-s 4.21KiB/CPU-s 0.0 0.024545 13.61MiB/CPU-s 362.99KiB/CPU-s 4.46KiB/CPU-s 0.0 0.026036 False False
http_to_http_noack ingress throughput -74.22B/CPU-s -0.00 0.93% 13.61MiB/CPU-s 356.89KiB/CPU-s 4.39KiB/CPU-s 0.0 0.025607 13.61MiB/CPU-s 360.67KiB/CPU-s 4.44KiB/CPU-s 0.0 0.025878 False False
splunk_hec_indexer_ack_blackhole ingress throughput -1.4KiB/CPU-s -0.01 24.10% 13.62MiB/CPU-s 260.28KiB/CPU-s 3.2KiB/CPU-s 0.0 0.018667 13.61MiB/CPU-s 264.69KiB/CPU-s 3.26KiB/CPU-s 0.0 0.018985 False False
datadog_agent_remap_blackhole_acks ingress throughput -45.96KiB/CPU-s -0.14 97.69% 31.1MiB/CPU-s 1.03MiB/CPU-s 12.99KiB/CPU-s 0.0 0.033148 31.06MiB/CPU-s 1.23MiB/CPU-s 15.51KiB/CPU-s 0.0 0.039625 False False
syslog_humio_logs ingress throughput -16.42KiB/CPU-s -0.18 99.80% 8.87MiB/CPU-s 319.03KiB/CPU-s 3.93KiB/CPU-s 0.0 0.035121 8.85MiB/CPU-s 289.83KiB/CPU-s 3.56KiB/CPU-s 0.0 0.031964 False False
http_to_http_json ingress throughput -37.47KiB/CPU-s -0.27 100.00% 13.62MiB/CPU-s 225.38KiB/CPU-s 2.77KiB/CPU-s 0.0 0.016157 13.58MiB/CPU-s 283.85KiB/CPU-s 3.49KiB/CPU-s 0.0 0.020404 False False
syslog_loki ingress throughput -36.02KiB/CPU-s -0.41 100.00% 8.58MiB/CPU-s 180.34KiB/CPU-s 2.22KiB/CPU-s 0.0 0.02052 8.55MiB/CPU-s 150.25KiB/CPU-s 1.85KiB/CPU-s 0.0 0.017167 False False
syslog_splunk_hec_logs ingress throughput -70.27KiB/CPU-s -0.78 100.00% 8.81MiB/CPU-s 263.97KiB/CPU-s 3.25KiB/CPU-s 0.0 0.029257 8.74MiB/CPU-s 282.15KiB/CPU-s 3.47KiB/CPU-s 0.0 0.031517 False False
syslog_log2metric_splunk_hec_metrics ingress throughput -95.12KiB/CPU-s -0.99 100.00% 9.43MiB/CPU-s 314.84KiB/CPU-s 3.87KiB/CPU-s 0.0 0.032619 9.33MiB/CPU-s 545.34KiB/CPU-s 6.71KiB/CPU-s 0.0 0.057063 False False
splunk_hec_route_s3 ingress throughput -122.83KiB/CPU-s -1.02 100.00% 11.73MiB/CPU-s 545.96KiB/CPU-s 6.72KiB/CPU-s 0.0 0.045459 11.61MiB/CPU-s 818.42KiB/CPU-s 10.07KiB/CPU-s 0.0 0.06885 False False
otlp_http_to_blackhole ingress throughput -26.44KiB/CPU-s -1.67 100.00% 1.55MiB/CPU-s 112.0KiB/CPU-s 1.38KiB/CPU-s 0.0 0.07057 1.52MiB/CPU-s 131.06KiB/CPU-s 1.61KiB/CPU-s 0.0 0.083979 False False
syslog_regex_logs2metric_ddmetrics ingress throughput -276.78KiB/CPU-s -7.53 100.00% 3.59MiB/CPU-s 373.93KiB/CPU-s 4.6KiB/CPU-s 0.0 0.101734 3.32MiB/CPU-s 422.27KiB/CPU-s 5.2KiB/CPU-s 0.0 0.12424 True True

@spencergilbert
Copy link
Contributor

LGTM, @nabokihms do you mind pushing an empty commit to trigger the k8s tests, just to be on the safe side?

Signed-off-by: m.nabokikh <maksim.nabokikh@flant.com>
@github-actions
Copy link

Regression Detector Results

Run ID: f12bdd09-68ef-417d-8296-b636f71d7193
Baseline: 972592b
Comparison: 032b571
Total vector CPUs: 7

Explanation

A regression test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine quickly if vector performance is changed and to what degree by a pull request.

The table below, if present, lists those experiments that have experienced a statistically significant change in mean optimization goal performance between baseline and comparison SHAs with 90.00% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±5.00% change in their mean optimization goal are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting change is observed.

No interesting changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%.

Fine details of change detection per experiment.
experiment goal Δ mean Δ mean % confidence baseline mean baseline stdev baseline stderr baseline outlier % baseline CoV comparison mean comparison stdev comparison stderr comparison outlier % comparison CoV erratic declared erratic
file_to_blackhole egress throughput 270.34KiB/CPU-s 3.74 85.13% 7.06MiB/CPU-s 4.01MiB/CPU-s 131.46KiB/CPU-s 0.0 0.568349 7.32MiB/CPU-s 3.92MiB/CPU-s 133.16KiB/CPU-s 0.0 0.534978 True True
http_text_to_http_json ingress throughput 610.45KiB/CPU-s 2.38 100.00% 25.04MiB/CPU-s 750.25KiB/CPU-s 9.23KiB/CPU-s 0.0 0.029255 25.64MiB/CPU-s 685.65KiB/CPU-s 8.44KiB/CPU-s 0.0 0.026115 False False
datadog_agent_remap_datadog_logs ingress throughput 682.44KiB/CPU-s 2.07 100.00% 32.13MiB/CPU-s 1.45MiB/CPU-s 18.31KiB/CPU-s 0.0 0.04523 32.8MiB/CPU-s 1.62MiB/CPU-s 20.37KiB/CPU-s 0.0 0.049295 False False
syslog_log2metric_humio_metrics ingress throughput 99.46KiB/CPU-s 1.57 100.00% 6.17MiB/CPU-s 298.65KiB/CPU-s 3.68KiB/CPU-s 0.0 0.047244 6.27MiB/CPU-s 194.31KiB/CPU-s 2.39KiB/CPU-s 0.0 0.030262 False False
otlp_grpc_to_blackhole ingress throughput 15.91KiB/CPU-s 1.56 100.00% 1019.97KiB/CPU-s 55.38KiB/CPU-s 697.67B/CPU-s 0.0 0.054291 1.01MiB/CPU-s 46.24KiB/CPU-s 582.76B/CPU-s 0.0 0.044636 False False
syslog_regex_logs2metric_ddmetrics ingress throughput 39.29KiB/CPU-s 1.04 100.00% 3.68MiB/CPU-s 419.39KiB/CPU-s 5.16KiB/CPU-s 0.0 0.111347 3.72MiB/CPU-s 419.31KiB/CPU-s 5.16KiB/CPU-s 0.0 0.110178 True True
syslog_splunk_hec_logs ingress throughput 89.68KiB/CPU-s 1.00 100.00% 8.79MiB/CPU-s 298.46KiB/CPU-s 3.67KiB/CPU-s 0.0 0.033167 8.87MiB/CPU-s 209.44KiB/CPU-s 2.58KiB/CPU-s 0.0 0.023045 False False
datadog_agent_remap_datadog_logs_acks ingress throughput 192.43KiB/CPU-s 0.59 100.00% 32.11MiB/CPU-s 1.32MiB/CPU-s 16.57KiB/CPU-s 0.0 0.04095 32.3MiB/CPU-s 1.32MiB/CPU-s 16.63KiB/CPU-s 0.0 0.040865 False False
otlp_http_to_blackhole ingress throughput 8.24KiB/CPU-s 0.53 99.99% 1.51MiB/CPU-s 127.76KiB/CPU-s 1.57KiB/CPU-s 0.0 0.082503 1.52MiB/CPU-s 119.05KiB/CPU-s 1.46KiB/CPU-s 0.0 0.076468 False False
datadog_agent_remap_blackhole ingress throughput 138.76KiB/CPU-s 0.44 100.00% 30.53MiB/CPU-s 1.41MiB/CPU-s 17.82KiB/CPU-s 0.0 0.046328 30.67MiB/CPU-s 1.68MiB/CPU-s 21.12KiB/CPU-s 0.0 0.054672 False False
splunk_hec_route_s3 ingress throughput 28.54KiB/CPU-s 0.24 98.90% 11.46MiB/CPU-s 691.19KiB/CPU-s 8.5KiB/CPU-s 0.0 0.058915 11.48MiB/CPU-s 595.13KiB/CPU-s 7.32KiB/CPU-s 0.0 0.050604 False False
http_to_http_json ingress throughput 19.87KiB/CPU-s 0.14 99.99% 13.58MiB/CPU-s 299.58KiB/CPU-s 3.69KiB/CPU-s 0.0 0.021539 13.6MiB/CPU-s 262.43KiB/CPU-s 3.23KiB/CPU-s 0.0 0.018841 False False
enterprise_http_to_http ingress throughput 11.27KiB/CPU-s 0.08 99.05% 13.61MiB/CPU-s 318.33KiB/CPU-s 3.92KiB/CPU-s 0.0 0.022834 13.62MiB/CPU-s 152.93KiB/CPU-s 1.88KiB/CPU-s 0.0 0.010961 False False
syslog_log2metric_splunk_hec_metrics ingress throughput 5.87KiB/CPU-s 0.06 62.30% 9.1MiB/CPU-s 368.14KiB/CPU-s 4.53KiB/CPU-s 0.0 0.039493 9.11MiB/CPU-s 395.57KiB/CPU-s 4.87KiB/CPU-s 0.0 0.042409 False False
fluent_elasticsearch ingress throughput -40.63B/CPU-s -0.00 6.11% 45.41MiB/CPU-s 30.29KiB/CPU-s 377.28B/CPU-s 0.0 0.000651 45.41MiB/CPU-s 29.92KiB/CPU-s 372.8B/CPU-s 0.0 0.000643 False False
splunk_hec_indexer_ack_blackhole ingress throughput -123.94B/CPU-s -0.00 2.14% 13.62MiB/CPU-s 258.66KiB/CPU-s 3.18KiB/CPU-s 0.0 0.018551 13.62MiB/CPU-s 259.48KiB/CPU-s 3.19KiB/CPU-s 0.0 0.01861 False False
splunk_hec_to_splunk_hec_logs_acks ingress throughput 556.03B/CPU-s 0.00 6.54% 13.61MiB/CPU-s 382.72KiB/CPU-s 4.71KiB/CPU-s 0.0 0.027452 13.61MiB/CPU-s 378.46KiB/CPU-s 4.65KiB/CPU-s 0.0 0.027145 False False
http_to_http_noack ingress throughput -105.36B/CPU-s -0.00 1.33% 13.61MiB/CPU-s 351.66KiB/CPU-s 4.33KiB/CPU-s 0.0 0.025231 13.61MiB/CPU-s 355.4KiB/CPU-s 4.37KiB/CPU-s 0.0 0.0255 False False
splunk_hec_to_splunk_hec_logs_noack ingress throughput -2.27KiB/CPU-s -0.02 38.19% 13.62MiB/CPU-s 253.12KiB/CPU-s 3.11KiB/CPU-s 0.0 0.018151 13.61MiB/CPU-s 270.58KiB/CPU-s 3.33KiB/CPU-s 0.0 0.019407 False False
http_to_http_acks ingress throughput -8.28KiB/CPU-s -0.16 13.65% 5.2MiB/CPU-s 2.73MiB/CPU-s 34.34KiB/CPU-s 0.0 0.52393 5.19MiB/CPU-s 2.68MiB/CPU-s 33.75KiB/CPU-s 0.0 0.515597 True False
syslog_loki ingress throughput -40.69KiB/CPU-s -0.47 100.00% 8.42MiB/CPU-s 293.57KiB/CPU-s 3.61KiB/CPU-s 0.0 0.03406 8.38MiB/CPU-s 321.46KiB/CPU-s 3.95KiB/CPU-s 0.0 0.037472 False False
socket_to_socket_blackhole ingress throughput -72.39KiB/CPU-s -0.52 100.00% 13.55MiB/CPU-s 413.13KiB/CPU-s 5.08KiB/CPU-s 0.0 0.029765 13.48MiB/CPU-s 570.15KiB/CPU-s 7.02KiB/CPU-s 0.0 0.041293 False False
syslog_humio_logs ingress throughput -135.36KiB/CPU-s -1.46 100.00% 9.03MiB/CPU-s 322.54KiB/CPU-s 3.97KiB/CPU-s 0.0 0.03489 8.89MiB/CPU-s 322.07KiB/CPU-s 3.96KiB/CPU-s 0.0 0.035357 False False
datadog_agent_remap_blackhole_acks ingress throughput -583.58KiB/CPU-s -1.84 100.00% 30.94MiB/CPU-s 1.54MiB/CPU-s 19.36KiB/CPU-s 0.0 0.049647 30.37MiB/CPU-s 1.31MiB/CPU-s 16.57KiB/CPU-s 0.0 0.043281 False False

Copy link
Contributor

@spencergilbert spencergilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 some spurious failures in the e2e tests that were resolved on a retry. Would be nice if I had a better idea why those intermittently fail, but it's not a priority right now 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants