chore(observability): robustly synchronize component validation framework tasks by neuronull · Pull Request #19927 · vectordotdev/vector

neuronull · 2024-02-22T00:10:16Z

TLDR:

A separate task shutdown order was introduced to handle sinks, which require knowing the external resource processed all the events.
The sink error path was changed to not rely on codecs but to rely on the external resource to emulate an error.

This change does a couple things that were side effects of the main goal, which was to remove the sleep that was added in the runner for the case of sinks. That was a temporary hack to get us going but as was seen in #19887 , it's not a long term solution as we needed to increase the sleep there due to the sink taking longer to process things.

In this discussion , we talked about synchronizing the sources but not the sinks.

The first attempt I made was in the TaskCoordinator , to split shutdown() into two methods- one to trigger the shutdown and the other to wait for it to complete. After that I played around with various orders of things, triggering the input task to shutdown, triggering topology to shutdown, etc. None of these fixed the issues we were having, in most cases they made other problems arise.

The "root cause" of the problem for sinks is that we need to have time for the sink to process all the events, the telemetry events to be received, AND for the external resource to process all the events. The only way to do this that I could see, was to drive the shutdown for sinks based on collecting the output task's result. Once we have that, we know the external resource has handled everything. Then it is safe to shutdown the component under test and the other tasks. As such, the solution was to have two different shutdown orders for sources/transforms, and for sinks.

The side effect of this change, was that it became obvious to me that the current solution for triggering errors in sinks is not ideal and not as representative of real world scenarios. This was discussed a bit in #18062. The current approach also raised challenges for the error path case in synchronizing the tasks shutdown, because we needed a way to know that the error had been raised by the sink somehow. The solution to this was to not leverage a codec error for the sinks, but instead an error from the external resource. This involved passing down the test case data to the external resource to be able to know when it needs to simulate an error.

TEST_LOG=info cargo nextest run --fail-fast --retries=0 --no-default-features --features component-validation-tests --status-level pass -j 4  components::validation::tests

    Starting 3 tests across 3 binaries (321 skipped)
        PASS [   5.778s] vector components::validation::tests::validate_component_tests_validation_components_sources_http_server_yaml
        PASS [  10.459s] vector components::validation::tests::validate_component_tests_validation_components_sources_http_client_yaml
        PASS [  11.709s] vector components::validation::tests::validate_component_tests_validation_components_sinks_http_yaml
------------

…r_validation

… into neuronull/component_validation_sink_component_spec

…r_validation

… into neuronull/component_validation_sink_component_spec

…ent_spec

…o neuronull/component_validation_sink_sad_path

datadog-vectordotdev · 2024-02-22T00:29:18Z

Datadog Report

Branch report: neuronull/OPA-1048_component_validation_fix_task_synchronization
Commit report: d720b9b
Test service: vector

✅ 0 Failed, 2122 Passed, 0 Skipped, 1m 25.27s Wall Time

…onull/OPA-1048_component_validation_fix_task_synchronization

neuronull · 2024-02-22T17:18:58Z

src/sinks/http/config.rs

-                // this config only runs with the test case "encoding_error" in the yaml file.
-                ComponentTestCaseConfig::from_sink(
-                    sad_config,
-                    Some("encoding_error".to_owned()),


Note that with this PR, there is currently nothing using the functionality I added to tie a specific config to a specific test case(s). But, I think this could be valuable in the future so figured I'd leave the plumbing for that in place.

src/components/validation/resources/event.rs

…x_task_synchronization

tobz

Overall, I think this is generally fine.

I left some comments around areas that perhaps felt a little too specialized to the test cases you were intending to fix, because it feels like we might be trading one bit of complexity for another in some cases... but none of it is blocking, and is really just meant to point out where to be mindful in the future during any further refactoring.

src/components/validation/sync.rs

src/components/validation/runner/telemetry.rs

tobz · 2024-02-28T14:30:25Z

src/components/validation/runner/mod.rs

+        if self.configuration.component_type == ComponentType::Sink {
+            input_task_coordinator.shutdown().await;
+
+            output_task_coordinator.shutdown().await;
+
+            let output_events = output_driver
+                .await
+                .expect("output driver task should not have panicked");
+
+            telemetry_task_coordinator.shutdown().await;
+
+            topology_task_coordinator.shutdown().await;
+
+            output_events
+
+        // for sources, and transforms the output events are collected from the controlled edge (vector sink),
+        // and as such the coordination of shutdown of the tasks is more straight forward.
+        } else {
+            input_task_coordinator.shutdown().await;
+
+            telemetry_task_coordinator.shutdown().await;
+
+            topology_task_coordinator.shutdown().await;
+
+            output_task_coordinator.shutdown().await;
+
+            output_driver
+                .await
+                .expect("output driver task should not have panicked")
+        }
+    }


I might be misreading this, but I wonder: could you not just use the first approach for all cases?

Like, if the first approach is working for the purpose of collecting/waiting long enough for all output events as well as telemetry data... then I'm trying to think of why it also wouldn't work for sources and transforms? 🤔

I had this thought as well when I was writing this. I had tried what you suggested and it didn't work for sources. I just tried it again and I realized there is a 3 birds one stone solution to the issues you have raised.

Instead of having the external resource be responsible for reporting if events received equals expected, we can do that in the output driver itself. This means sources will be able to use the same shutdown order, because the controlled edge doesn't need to coordinate things. And it means any other external resource implementation doesn't need to have that logic repeated.

It also presented an opportunity to add a timeout mechanism so that we still gracefully shutdown.

src/components/validation/resources/http.rs

github-actions · 2024-02-28T19:59:56Z

Regression Detector Results

Run ID: b3e7db72-4e27-429d-89ed-4b62b3b9d64d
Baseline: a9cee3f
Comparison: 43a9129
Total CPUs: 7

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	syslog_log2metric_humio_metrics	ingress throughput	+2.90	[+2.76, +3.05]
➖	syslog_log2metric_splunk_hec_metrics	ingress throughput	+2.67	[+2.52, +2.83]
➖	syslog_humio_logs	ingress throughput	+1.57	[+1.47, +1.68]
➖	socket_to_socket_blackhole	ingress throughput	+1.51	[+1.44, +1.59]
➖	http_text_to_http_json	ingress throughput	+1.11	[+0.98, +1.24]
➖	otlp_http_to_blackhole	ingress throughput	+0.82	[+0.67, +0.98]
➖	file_to_blackhole	egress throughput	+0.82	[-1.74, +3.38]
➖	syslog_regex_logs2metric_ddmetrics	ingress throughput	+0.75	[+0.62, +0.88]
➖	syslog_loki	ingress throughput	+0.60	[+0.55, +0.64]
➖	syslog_log2metric_tag_cardinality_limit_blackhole	ingress throughput	+0.60	[+0.48, +0.71]
➖	splunk_hec_route_s3	ingress throughput	+0.46	[-0.04, +0.96]
➖	http_to_s3	ingress throughput	+0.42	[+0.14, +0.70]
➖	http_to_http_acks	ingress throughput	+0.19	[-1.13, +1.51]
➖	datadog_agent_remap_datadog_logs	ingress throughput	+0.12	[+0.02, +0.21]
➖	http_to_http_noack	ingress throughput	+0.11	[+0.01, +0.21]
➖	http_to_http_json	ingress throughput	+0.03	[-0.05, +0.10]
➖	splunk_hec_to_splunk_hec_logs_acks	ingress throughput	-0.00	[-0.16, +0.16]
➖	splunk_hec_indexer_ack_blackhole	ingress throughput	-0.00	[-0.15, +0.14]
➖	splunk_hec_to_splunk_hec_logs_noack	ingress throughput	-0.03	[-0.14, +0.09]
➖	enterprise_http_to_http	ingress throughput	-0.11	[-0.19, -0.03]
➖	syslog_splunk_hec_logs	ingress throughput	-0.11	[-0.18, -0.05]
➖	datadog_agent_remap_datadog_logs_acks	ingress throughput	-0.22	[-0.31, -0.13]
➖	fluent_elasticsearch	ingress throughput	-0.39	[-0.85, +0.08]
➖	otlp_grpc_to_blackhole	ingress throughput	-0.51	[-0.60, -0.42]
➖	datadog_agent_remap_blackhole	ingress throughput	-0.62	[-0.71, -0.53]
➖	datadog_agent_remap_blackhole_acks	ingress throughput	-0.95	[-1.04, -0.87]
➖	http_elasticsearch	ingress throughput	-0.96	[-1.02, -0.89]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

…work tasks (vectordotdev#19927) * add fix and small refactor * fix compilation errors * 3 ticks * dont compute expected metrics in validator * cleanup * cleanup * clippy * feedback tz: sent_eventssssss * feedback tz: fix telemetry shutdown finishing logic * 3 ticks * small reorg to add sinks * mini refactor of the component spec validators * attempt to set expected values from the resource * feedback tz- from not try_from * back to 3 ticks * fix incorrect expected values * Even more reduction * clippy * add the discarded events total check * workaround the new sync issues * multi config support * cleanup * check events * partial feedback * thought i removed that * use ref * feedback: dont introduce PassThroughFail variant * feedback: adjust enum variant names for clarity * feedback: no idea what I was thinking with `input_codec` * spell check * fr * fix sync issues * remove unused enum variant * feedback- update docs * check_events * touchup * spell checker * merge leftover * feedback: log formating * feedback- better approach to driving shutdown * give a generous timeout

neuronull added 30 commits July 11, 2023 10:59

add fix and small refactor

09c562a

fix compilation errors

1e0e6e7

3 ticks

63a9581

dont compute expected metrics in validator

c6af43e

cleanup

55a3518

cleanup

8ec87b3

clippy

4b3b721

feedback tz: sent_eventssssss

f9854bf

feedback tz: fix telemetry shutdown finishing logic

0577ee6

3 ticks

51e9ab4

Merge branch 'master' into neuronull/draft_component_validation_bette…

99a2d20

…r_validation

small reorg to add sinks

e8cdf11

mini refactor of the component spec validators

c460a49

attempt to set expected values from the resource

3daced5

feedback tz- from not try_from

b7a7bd3

Merge branch 'neuronull/draft_component_validation_better_validation'…

af7e9b2

… into neuronull/component_validation_sink_component_spec

back to 3 ticks

0ce0e25

fix incorrect expected values

35efd5a

Even more reduction

1f4ea02

clippy

e8b17af

add the discarded events total check

cdeab8f

Merge branch 'master' into neuronull/draft_component_validation_bette…

a0f7a65

…r_validation

Merge branch 'neuronull/draft_component_validation_better_validation'…

0a6c056

… into neuronull/component_validation_sink_component_spec

Merge branch 'master' into neuronull/component_validation_sink_compon…

77c110b

…ent_spec

workaround the new sync issues

006db51

multi config support

604abea

cleanup

82faf6c

Merge branch 'master' into neuronull/component_validation_sink_compon…

4745a2f

…ent_spec

Merge branch 'neuronull/component_validation_sink_component_spec' int…

b8650d0

…o neuronull/component_validation_sink_sad_path

check events

1a43e8b

neuronull added no-changelog Changes in this PR do not need user-facing explanations in the release changelog domain: observability Anything related to monitoring/observing Vector domain: tests Anything related to Vector's internal tests labels Feb 22, 2024

neuronull added 4 commits February 22, 2024 09:15

feedback- update docs

0415759

Merge branch 'neuronull/component_validation_sink_sad_path' into neur…

68559cf

…onull/OPA-1048_component_validation_fix_task_synchronization

check_events

f0dcb41

touchup

43ea0fe

neuronull commented Feb 22, 2024

View reviewed changes

neuronull requested a review from tobz February 22, 2024 17:19

neuronull changed the title ~~[DRAFT] chore(observability): robustly synchronize component validation framework tasks~~ chore(observability): robustly synchronize component validation framework tasks Feb 22, 2024

github-advanced-security bot found potential problems Feb 22, 2024

View reviewed changes

src/components/validation/resources/event.rs Fixed Show fixed Hide fixed

spell checker

472e7c1

neuronull marked this pull request as ready for review February 22, 2024 17:34

neuronull requested a review from a team February 22, 2024 17:34

Base automatically changed from neuronull/component_validation_sink_sad_path to master February 22, 2024 17:57

neuronull added 2 commits February 22, 2024 11:24

Merge branch 'master' into neuronull/OPA-1048_component_validation_fi…

5a45287

…x_task_synchronization

merge leftover

0fb8eb9

neuronull mentioned this pull request Feb 22, 2024

chore(observability): add component spec validation tests for datadog_logs sink #19887

Merged

tobz approved these changes Feb 28, 2024

View reviewed changes

neuronull added 3 commits February 28, 2024 09:03

feedback: log formating

be38808

feedback- better approach to driving shutdown

64303c1

give a generous timeout

173b2e0

neuronull enabled auto-merge February 28, 2024 18:59

neuronull added this pull request to the merge queue Feb 28, 2024

Merged via the queue into master with commit 43a9129 Feb 28, 2024

neuronull deleted the neuronull/OPA-1048_component_validation_fix_task_synchronization branch February 28, 2024 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(observability): robustly synchronize component validation framework tasks#19927

chore(observability): robustly synchronize component validation framework tasks#19927
neuronull merged 54 commits intomasterfrom
neuronull/OPA-1048_component_validation_fix_task_synchronization

neuronull commented Feb 22, 2024 •

edited

Loading

Uh oh!

datadog-vectordotdev bot commented Feb 22, 2024 •

edited

Loading

Uh oh!

neuronull Feb 22, 2024

Uh oh!

Uh oh!

tobz left a comment

Uh oh!

Uh oh!

Uh oh!

tobz Feb 28, 2024

Uh oh!

neuronull Feb 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 28, 2024

Fine details of change detection per experiment

Explanation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neuronull commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-vectordotdev bot commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Datadog Report

Uh oh!

neuronull Feb 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tobz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tobz Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

neuronull Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 28, 2024

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Explanation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neuronull commented Feb 22, 2024 •

edited

Loading

datadog-vectordotdev bot commented Feb 22, 2024 •

edited

Loading