metrics service sink: generalize the sink and grpc streamer for external use#13919
metrics service sink: generalize the sink and grpc streamer for external use#13919snowp merged 14 commits intoenvoyproxy:masterfrom junr03:generalize-metrics-service
Conversation
…nal use Signed-off-by: Jose Nino <jnino@lyft.com>
|
@jingwei99 I have created as a "draft PR" so we can talk about it before we ask for a review. |
|
Missing one last place to templetize. Will update after lunch. |
Signed-off-by: Jose Nino <jnino@lyft.com>
junr03
left a comment
There was a problem hiding this comment.
@snowp @mattklein123 ready from a pass from either of you
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Outdated
Show resolved
Hide resolved
snowp
left a comment
There was a problem hiding this comment.
Thanks! Overall this makes sense, just a few suggestions
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.cc
Outdated
Show resolved
Hide resolved
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Outdated
Show resolved
Hide resolved
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Outdated
Show resolved
Hide resolved
|
Generally looks fine to me. Will wait for Snow to finish and then I will take a pass. Thanks. /wait |
|
@snowp updated! |
snowp
left a comment
There was a problem hiding this comment.
Thanks this looks cleaner, just some minor comments
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.cc
Outdated
Show resolved
Hide resolved
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Outdated
Show resolved
Hide resolved
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Outdated
Show resolved
Hide resolved
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Outdated
Show resolved
Hide resolved
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Show resolved
Hide resolved
source/extensions/stat_sinks/metrics_service/grpc_metrics_service_impl.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
snowp
left a comment
There was a problem hiding this comment.
Thanks! This looks good to me
mattklein123
left a comment
There was a problem hiding this comment.
Thanks LGTM with one small question.
| auto metrics = | ||
| std::make_unique<Envoy::Protobuf::RepeatedPtrField<io::prometheus::client::MetricFamily>>(); |
There was a problem hiding this comment.
Are you going to need to override this also? Or if you do will you override the entire flusher?
There was a problem hiding this comment.
If we overrode, then yes, we would have to override the entire flusher. This proto struct is the minimal container that the flusher uses to aggregate metrics, so I don't see us changing it; unless we wanted to change the metrics format we use on the wire, and in the server-side service. From the core use in Envoy Mobile, and from my conversations with @jingwei99, I haven't come up with a good use case for changing the data format.
|
Trying to repro the gcc failure locally |
Signed-off-by: Jose Nino <jnino@lyft.com>
|
I can repro the issue here by compiling on RBE with the remote-gcc target: it seems like the code hangs in waiting for data: for whatever reason it doesn't seem to want to give me the full trace, seems like something is off with the .dwp file? Interestingly it appears to hang on this assertion until this process receives SIGINT, at which point it seems to be able to actually time out: It's a bit hard to tell exactly what's going on with the broken debugging, so I'm not sure why a SIGINT makes the process stop spinning |
|
@jmarantz do you have a few minutes to see if you had any thoughts given your familiarity with stats/timing in tests. |
|
cc @antoniovicente as well |
antoniovicente
left a comment
There was a problem hiding this comment.
I can repro the issue here by compiling on RBE with the remote-gcc target: it seems like the code hangs in waiting for data:
(gdb) bt #0 0x00007ffff6f2c639 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000555557c656d7 in absl::synchronization_internal::FutexImpl::WaitUntil (v=0x7ffff7fb8140, val=0, t=...) at external/com_google_absl/absl/synchronization/internal/futex.h:97 #2 0x0000555557c65459 in absl::synchronization_internal::Waiter::Wait (this=0x7ffff7fb8140, t=...) at external/com_google_absl/absl/synchronization/internal/waiter.cc:95 #3 0x0000555557c65191 in AbslInternalPerThreadSemWait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.cc:93 #4 0x0000555557c63a28 in absl::synchronization_internal::PerThreadSem::Wait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.h:112 #5 0x0000555557c63c48 in absl::Mutex::DecrementSynchSem (mu=0xf89bf054610, w=0x7ffff7fb8100, t=...) at external/com_google_absl/absl/synchronization/mutex.cc:568 #6 0x0000555557c5d9cf in absl::Mutex::Block (this=0xf89bf054610, s=0x7ffff7fb8100) at external/com_google_absl/absl/synchronization/mutex.cc:1116 #7 0x0000555557c5f2db in absl::Mutex::AwaitCommon (this=0xf89bf054610, cond=..., t=...) at external/com_google_absl/absl/synchronization/mutex.cc:1580 #8 0x0000555557c5f16e in absl::Mutex::AwaitWithDeadline (this=0xf89bf054610, cond=..., deadline=...) at external/com_google_absl/absl/synchronization/mutex.cc:1561 #9 0x0000555557c5f100 in absl::Mutex::AwaitWithTimeout (this=0xf89bf054610, cond=..., timeout=...) at external/com_google_absl/absl/synchronization/mutex.cc:1549 #10 0x00005555559d3f53 in Envoy::Event::TestTimeSystem::waitFor<std::chrono::duration<long, std::ratio<1l, 1000l> > > (this=0xf89bfd38400, mutex=..., condition=..., duration=...) at ./test/test_common/test_time_system.h:127 #11 0x00005555559dd8f4 in Envoy::(anonymous namespace)::waitForWithDispatcherRun(Envoy::Event::TestTimeSystem &, absl::Mutex &, const std::function<bool()> &, Envoy::Event::Dispatcher &, std::chrono::milliseconds) (time_system=..., lock=..., condition=..., client_dispatcher=..., timeout=...) at test/integration/fake_upstream.cc:156 #12 0x00005555559dd9ff in Envoy::FakeStream::waitForData ( this=<error reading variable: DW_FORM_strp pointing outside of .debug_str section [in module /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-dbg/bin/test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.dwp]>, client_dispatcher=..., body_length=5, timeout=...) at test/integration/fake_upstream.cc:170for whatever reason it doesn't seem to want to give me the full trace, seems like something is off with the .dwp file?
Interestingly it appears to hang on this assertion until this process receives SIGINT, at which point it seems to be able to actually time out:
test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc:165: Failure Value of: waitForMetricsRequest() Actual: false (Timed out waiting for start of gRPC message.) Expected: true Stack trace: 0x55e278613932: Envoy::(anonymous namespace)::MetricsServiceIntegrationTest_BasicFlow_Test::TestBody() 0x55e27a94e514: testing::internal::HandleSehExceptionsInMethodIfSupported<>() 0x55e27a949bdb: testing::internal::HandleExceptionsInMethodIfSupported<>() 0x55e27a932104: testing::Test::Run() 0x55e27a932a71: testing::TestInfo::Run() ... Google Test internal frames ...It's a bit hard to tell exactly what's going on with the broken debugging, so I'm not sure why a SIGINT makes the process stop spinning
What are the repo instructions? Something like: bazel test --config=remote-gcc
| } | ||
| EXPECT_TRUE(known_counter_exists); | ||
| EXPECT_TRUE(known_gauge_exists); | ||
| EXPECT_TRUE(known_summary_exists); |
There was a problem hiding this comment.
This change breaks the gcc build due to write-only local variable:
2020-12-07T22:10:37.7506320Z Execution platform: @rbe_ubuntu_gcc//config:platform
2020-12-07T22:10:37.7507984Z test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc: In member function 'testing::AssertionResult Envoy::{anonymous}::MetricsServiceIntegrationTest::waitForMetricsRequest()':
2020-12-07T22:10:37.7509326Z test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc:69:10: error: variable 'known_summary_exists' set but not used [-Werror=unused-but-set-variable]
2020-12-07T22:10:37.7510349Z 69 | bool known_summary_exists = false;
2020-12-07T22:10:37.7510693Z | ^~~~~~~~~~~~~~~~~~~~
2020-12-07T22:10:37.7511039Z cc1plus: all warnings being treated as errors
There was a problem hiding this comment.
🤦 yeah sorry, a commit from earlier in the day didn't make it up here because of a failed git hook that I missed
From looking at the code it seems that the call to metrics_service_request_->waitForGrpcMessage is timing out after 20 seconds. The failure mode you hit seems different from the one from the most recent CI where the test timed out after 300+ seconds. It would be good to know the status of (known_counter_exists, known_gauge_exists, known_histogram_exists) after N attempts. Also, it may be worth a sleep between iterations so that the proxy has some time to do other things between probes. |
|
@antoniovicente thanks for your comment. It made me re-read the code carefully again and notice that this https://github.com/envoyproxy/envoy/pull/13919/files#diff-049bdaed2b6373b41b631030556ffa750151abac8d91526d515a61a6c5054b53R122 check was missing. I am guessing that in gcc the loop ordering was different than in clang. So in gcc the linked line was breaking out of the for loop into the while loop without getting to the summary metric. Thus keeping the while loop stuck forever until bazel times out the test. Tests are now green |
|
@snowp ready for a final stamp. |
Fantastic. |
* master: buffer: Optimize the layout of Slices in Buffer::OwnedImpl by removing subclassing and storing slice info directly in the SliceDeque (envoyproxy#14282) gRPC client to be used by ext_proc filter (envoyproxy#14283) http2: Add integration tests for PRIORITY frame flood mitigation for upstream servers (envoyproxy#14328) event: touch watchdog before execution of each post callback and before deferred deletion (envoyproxy#14339) stale: more allowed ops (envoyproxy#14345) stale: more changes (envoyproxy#14344) test: TODO fixup making enable_half_close private envoyproxy#14330) event: Reduce potential for lock contention while executing dispatcher post callbacks. (envoyproxy#14289) stale: fix config (envoyproxy#14337) metrics service sink: generalize the sink and grpc streamer for external use (envoyproxy#13919) wasm: update V8 to v8.8.278.8. (envoyproxy#14298) repo: switch to actions based stale bot (envoyproxy#14335) buffer: Use WatermarkFactory to create most WatermarkBuffer instances (envoyproxy#14256) Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Description: Replace the metrics sink from the envoy one to a custom one. The custom metrics sink comes with ack function on the grpc stream. This change is based on upstream envoy change: envoyproxy/envoy#13919 Risk Level: High Testing: Local and ci unit tests and integration tests Signed-off-by: Jingwei Hao <jingweih@lyft.com>
Description: Replace the metrics sink from the envoy one to a custom one. The custom metrics sink comes with ack function on the grpc stream. This change is based on upstream envoy change: #13919 Risk Level: High Testing: Local and ci unit tests and integration tests Signed-off-by: Jingwei Hao <jingweih@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: Replace the metrics sink from the envoy one to a custom one. The custom metrics sink comes with ack function on the grpc stream. This change is based on upstream envoy change: #13919 Risk Level: High Testing: Local and ci unit tests and integration tests Signed-off-by: Jingwei Hao <jingweih@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>
Commit Message: metrics service sink - generalize the sink and grpc streamer for external use
Additional Description: the changes here templetize the abstract grpc streamer class and unwrap the request proto message in order to allow for external reuse of the classes in Envoy to build additional logic in a generic manner. There is no functional difference in the existing sink evidenced by the absence of changes in testing.
Risk Level: low - no functional changes.
Testing: no changes in testing gives confidence in the functional parity before and after the changes being committed.
Signed-off-by: Jose Nino jnino@lyft.com