Skip to content

metrics service sink: generalize the sink and grpc streamer for external use#13919

Merged
snowp merged 14 commits intoenvoyproxy:masterfrom
junr03:generalize-metrics-service
Dec 9, 2020
Merged

metrics service sink: generalize the sink and grpc streamer for external use#13919
snowp merged 14 commits intoenvoyproxy:masterfrom
junr03:generalize-metrics-service

Conversation

@junr03
Copy link
Copy Markdown
Member

@junr03 junr03 commented Nov 5, 2020

Commit Message: metrics service sink - generalize the sink and grpc streamer for external use
Additional Description: the changes here templetize the abstract grpc streamer class and unwrap the request proto message in order to allow for external reuse of the classes in Envoy to build additional logic in a generic manner. There is no functional difference in the existing sink evidenced by the absence of changes in testing.
Risk Level: low - no functional changes.
Testing: no changes in testing gives confidence in the functional parity before and after the changes being committed.

Signed-off-by: Jose Nino jnino@lyft.com

…nal use

Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Nov 5, 2020

@jingwei99 I have created as a "draft PR" so we can talk about it before we ask for a review.

@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Nov 5, 2020

Missing one last place to templetize. Will update after lunch.

Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03 junr03 marked this pull request as ready for review November 5, 2020 21:29
@junr03 junr03 requested a review from jmarantz as a code owner November 5, 2020 21:29
Copy link
Copy Markdown
Member Author

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snowp @mattklein123 ready from a pass from either of you

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall this makes sense, just a few suggestions

@mattklein123
Copy link
Copy Markdown
Member

Generally looks fine to me. Will wait for Snow to finish and then I will take a pass. Thanks.

/wait

Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Nov 10, 2020

@snowp updated!

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks cleaner, just some minor comments

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
snowp
snowp previously approved these changes Nov 16, 2020
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to me

fmt
Signed-off-by: Jose Nino <jnino@lyft.com>
mattklein123
mattklein123 previously approved these changes Nov 16, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM with one small question.

Comment on lines +50 to +51
auto metrics =
std::make_unique<Envoy::Protobuf::RepeatedPtrField<io::prometheus::client::MetricFamily>>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to need to override this also? Or if you do will you override the entire flusher?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we overrode, then yes, we would have to override the entire flusher. This proto struct is the minimal container that the flusher uses to aggregate metrics, so I don't see us changing it; unless we wanted to change the metrics format we use on the wire, and in the server-side service. From the core use in Envoy Mobile, and from my conversations with @jingwei99, I haven't come up with a good use case for changing the data format.

@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Nov 17, 2020

Trying to repro the gcc failure locally

Signed-off-by: Jose Nino <jnino@lyft.com>
@snowp
Copy link
Copy Markdown
Contributor

snowp commented Dec 7, 2020

I can repro the issue here by compiling on RBE with the remote-gcc target: it seems like the code hangs in waiting for data:

(gdb) bt
#0  0x00007ffff6f2c639 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000555557c656d7 in absl::synchronization_internal::FutexImpl::WaitUntil (v=0x7ffff7fb8140, val=0, t=...) at external/com_google_absl/absl/synchronization/internal/futex.h:97
#2  0x0000555557c65459 in absl::synchronization_internal::Waiter::Wait (this=0x7ffff7fb8140, t=...) at external/com_google_absl/absl/synchronization/internal/waiter.cc:95
#3  0x0000555557c65191 in AbslInternalPerThreadSemWait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.cc:93
#4  0x0000555557c63a28 in absl::synchronization_internal::PerThreadSem::Wait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.h:112
#5  0x0000555557c63c48 in absl::Mutex::DecrementSynchSem (mu=0xf89bf054610, w=0x7ffff7fb8100, t=...) at external/com_google_absl/absl/synchronization/mutex.cc:568
#6  0x0000555557c5d9cf in absl::Mutex::Block (this=0xf89bf054610, s=0x7ffff7fb8100) at external/com_google_absl/absl/synchronization/mutex.cc:1116
#7  0x0000555557c5f2db in absl::Mutex::AwaitCommon (this=0xf89bf054610, cond=..., t=...) at external/com_google_absl/absl/synchronization/mutex.cc:1580
#8  0x0000555557c5f16e in absl::Mutex::AwaitWithDeadline (this=0xf89bf054610, cond=..., deadline=...) at external/com_google_absl/absl/synchronization/mutex.cc:1561
#9  0x0000555557c5f100 in absl::Mutex::AwaitWithTimeout (this=0xf89bf054610, cond=..., timeout=...) at external/com_google_absl/absl/synchronization/mutex.cc:1549
#10 0x00005555559d3f53 in Envoy::Event::TestTimeSystem::waitFor<std::chrono::duration<long, std::ratio<1l, 1000l> > > (this=0xf89bfd38400, mutex=..., condition=..., duration=...)
    at ./test/test_common/test_time_system.h:127
#11 0x00005555559dd8f4 in Envoy::(anonymous namespace)::waitForWithDispatcherRun(Envoy::Event::TestTimeSystem &, absl::Mutex &, const std::function<bool()> &, Envoy::Event::Dispatcher &, std::chrono::milliseconds) (time_system=..., lock=..., condition=..., client_dispatcher=..., timeout=...) at test/integration/fake_upstream.cc:156
#12 0x00005555559dd9ff in Envoy::FakeStream::waitForData (
    this=<error reading variable: DW_FORM_strp pointing outside of .debug_str section [in module /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-dbg/bin/test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.dwp]>, client_dispatcher=..., body_length=5, timeout=...)
    at test/integration/fake_upstream.cc:170

for whatever reason it doesn't seem to want to give me the full trace, seems like something is off with the .dwp file?

Interestingly it appears to hang on this assertion until this process receives SIGINT, at which point it seems to be able to actually time out:

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc:165: Failure
Value of: waitForMetricsRequest()
  Actual: false (Timed out waiting for start of gRPC message.)
Expected: true
Stack trace:
  0x55e278613932: Envoy::(anonymous namespace)::MetricsServiceIntegrationTest_BasicFlow_Test::TestBody()
  0x55e27a94e514: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
  0x55e27a949bdb: testing::internal::HandleExceptionsInMethodIfSupported<>()
  0x55e27a932104: testing::Test::Run()
  0x55e27a932a71: testing::TestInfo::Run()
... Google Test internal frames ...

It's a bit hard to tell exactly what's going on with the broken debugging, so I'm not sure why a SIGINT makes the process stop spinning

@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Dec 7, 2020

@jmarantz do you have a few minutes to see if you had any thoughts given your familiarity with stats/timing in tests.

@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Dec 7, 2020

cc @antoniovicente as well

Signed-off-by: Jose Nino <jnino@lyft.com>
fix
Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can repro the issue here by compiling on RBE with the remote-gcc target: it seems like the code hangs in waiting for data:

(gdb) bt
#0  0x00007ffff6f2c639 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000555557c656d7 in absl::synchronization_internal::FutexImpl::WaitUntil (v=0x7ffff7fb8140, val=0, t=...) at external/com_google_absl/absl/synchronization/internal/futex.h:97
#2  0x0000555557c65459 in absl::synchronization_internal::Waiter::Wait (this=0x7ffff7fb8140, t=...) at external/com_google_absl/absl/synchronization/internal/waiter.cc:95
#3  0x0000555557c65191 in AbslInternalPerThreadSemWait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.cc:93
#4  0x0000555557c63a28 in absl::synchronization_internal::PerThreadSem::Wait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.h:112
#5  0x0000555557c63c48 in absl::Mutex::DecrementSynchSem (mu=0xf89bf054610, w=0x7ffff7fb8100, t=...) at external/com_google_absl/absl/synchronization/mutex.cc:568
#6  0x0000555557c5d9cf in absl::Mutex::Block (this=0xf89bf054610, s=0x7ffff7fb8100) at external/com_google_absl/absl/synchronization/mutex.cc:1116
#7  0x0000555557c5f2db in absl::Mutex::AwaitCommon (this=0xf89bf054610, cond=..., t=...) at external/com_google_absl/absl/synchronization/mutex.cc:1580
#8  0x0000555557c5f16e in absl::Mutex::AwaitWithDeadline (this=0xf89bf054610, cond=..., deadline=...) at external/com_google_absl/absl/synchronization/mutex.cc:1561
#9  0x0000555557c5f100 in absl::Mutex::AwaitWithTimeout (this=0xf89bf054610, cond=..., timeout=...) at external/com_google_absl/absl/synchronization/mutex.cc:1549
#10 0x00005555559d3f53 in Envoy::Event::TestTimeSystem::waitFor<std::chrono::duration<long, std::ratio<1l, 1000l> > > (this=0xf89bfd38400, mutex=..., condition=..., duration=...)
    at ./test/test_common/test_time_system.h:127
#11 0x00005555559dd8f4 in Envoy::(anonymous namespace)::waitForWithDispatcherRun(Envoy::Event::TestTimeSystem &, absl::Mutex &, const std::function<bool()> &, Envoy::Event::Dispatcher &, std::chrono::milliseconds) (time_system=..., lock=..., condition=..., client_dispatcher=..., timeout=...) at test/integration/fake_upstream.cc:156
#12 0x00005555559dd9ff in Envoy::FakeStream::waitForData (
    this=<error reading variable: DW_FORM_strp pointing outside of .debug_str section [in module /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-dbg/bin/test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.dwp]>, client_dispatcher=..., body_length=5, timeout=...)
    at test/integration/fake_upstream.cc:170

for whatever reason it doesn't seem to want to give me the full trace, seems like something is off with the .dwp file?

Interestingly it appears to hang on this assertion until this process receives SIGINT, at which point it seems to be able to actually time out:

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc:165: Failure
Value of: waitForMetricsRequest()
  Actual: false (Timed out waiting for start of gRPC message.)
Expected: true
Stack trace:
  0x55e278613932: Envoy::(anonymous namespace)::MetricsServiceIntegrationTest_BasicFlow_Test::TestBody()
  0x55e27a94e514: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
  0x55e27a949bdb: testing::internal::HandleExceptionsInMethodIfSupported<>()
  0x55e27a932104: testing::Test::Run()
  0x55e27a932a71: testing::TestInfo::Run()
... Google Test internal frames ...

It's a bit hard to tell exactly what's going on with the broken debugging, so I'm not sure why a SIGINT makes the process stop spinning

What are the repo instructions? Something like: bazel test --config=remote-gcc

}
EXPECT_TRUE(known_counter_exists);
EXPECT_TRUE(known_gauge_exists);
EXPECT_TRUE(known_summary_exists);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change breaks the gcc build due to write-only local variable:

2020-12-07T22:10:37.7506320Z Execution platform: @rbe_ubuntu_gcc//config:platform
2020-12-07T22:10:37.7507984Z test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc: In member function 'testing::AssertionResult Envoy::{anonymous}::MetricsServiceIntegrationTest::waitForMetricsRequest()':
2020-12-07T22:10:37.7509326Z test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc:69:10: error: variable 'known_summary_exists' set but not used [-Werror=unused-but-set-variable]
2020-12-07T22:10:37.7510349Z 69 | bool known_summary_exists = false;
2020-12-07T22:10:37.7510693Z | ^~~~~~~~~~~~~~~~~~~~
2020-12-07T22:10:37.7511039Z cc1plus: all warnings being treated as errors

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 yeah sorry, a commit from earlier in the day didn't make it up here because of a failed git hook that I missed

@antoniovicente
Copy link
Copy Markdown
Contributor

I can repro the issue here by compiling on RBE with the remote-gcc target: it seems like the code hangs in waiting for data:

(gdb) bt
#0  0x00007ffff6f2c639 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000555557c656d7 in absl::synchronization_internal::FutexImpl::WaitUntil (v=0x7ffff7fb8140, val=0, t=...) at external/com_google_absl/absl/synchronization/internal/futex.h:97
#2  0x0000555557c65459 in absl::synchronization_internal::Waiter::Wait (this=0x7ffff7fb8140, t=...) at external/com_google_absl/absl/synchronization/internal/waiter.cc:95
#3  0x0000555557c65191 in AbslInternalPerThreadSemWait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.cc:93
#4  0x0000555557c63a28 in absl::synchronization_internal::PerThreadSem::Wait (t=...) at external/com_google_absl/absl/synchronization/internal/per_thread_sem.h:112
#5  0x0000555557c63c48 in absl::Mutex::DecrementSynchSem (mu=0xf89bf054610, w=0x7ffff7fb8100, t=...) at external/com_google_absl/absl/synchronization/mutex.cc:568
#6  0x0000555557c5d9cf in absl::Mutex::Block (this=0xf89bf054610, s=0x7ffff7fb8100) at external/com_google_absl/absl/synchronization/mutex.cc:1116
#7  0x0000555557c5f2db in absl::Mutex::AwaitCommon (this=0xf89bf054610, cond=..., t=...) at external/com_google_absl/absl/synchronization/mutex.cc:1580
#8  0x0000555557c5f16e in absl::Mutex::AwaitWithDeadline (this=0xf89bf054610, cond=..., deadline=...) at external/com_google_absl/absl/synchronization/mutex.cc:1561
#9  0x0000555557c5f100 in absl::Mutex::AwaitWithTimeout (this=0xf89bf054610, cond=..., timeout=...) at external/com_google_absl/absl/synchronization/mutex.cc:1549
#10 0x00005555559d3f53 in Envoy::Event::TestTimeSystem::waitFor<std::chrono::duration<long, std::ratio<1l, 1000l> > > (this=0xf89bfd38400, mutex=..., condition=..., duration=...)
    at ./test/test_common/test_time_system.h:127
#11 0x00005555559dd8f4 in Envoy::(anonymous namespace)::waitForWithDispatcherRun(Envoy::Event::TestTimeSystem &, absl::Mutex &, const std::function<bool()> &, Envoy::Event::Dispatcher &, std::chrono::milliseconds) (time_system=..., lock=..., condition=..., client_dispatcher=..., timeout=...) at test/integration/fake_upstream.cc:156
#12 0x00005555559dd9ff in Envoy::FakeStream::waitForData (
    this=<error reading variable: DW_FORM_strp pointing outside of .debug_str section [in module /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-dbg/bin/test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.dwp]>, client_dispatcher=..., body_length=5, timeout=...)
    at test/integration/fake_upstream.cc:170

for whatever reason it doesn't seem to want to give me the full trace, seems like something is off with the .dwp file?

Interestingly it appears to hang on this assertion until this process receives SIGINT, at which point it seems to be able to actually time out:

test/extensions/stats_sinks/metrics_service/metrics_service_integration_test.cc:165: Failure
Value of: waitForMetricsRequest()
  Actual: false (Timed out waiting for start of gRPC message.)
Expected: true
Stack trace:
  0x55e278613932: Envoy::(anonymous namespace)::MetricsServiceIntegrationTest_BasicFlow_Test::TestBody()
  0x55e27a94e514: testing::internal::HandleSehExceptionsInMethodIfSupported<>()
  0x55e27a949bdb: testing::internal::HandleExceptionsInMethodIfSupported<>()
  0x55e27a932104: testing::Test::Run()
  0x55e27a932a71: testing::TestInfo::Run()
... Google Test internal frames ...

It's a bit hard to tell exactly what's going on with the broken debugging, so I'm not sure why a SIGINT makes the process stop spinning

From looking at the code it seems that the call to metrics_service_request_->waitForGrpcMessage is timing out after 20 seconds. The failure mode you hit seems different from the one from the most recent CI where the test timed out after 300+ seconds. It would be good to know the status of (known_counter_exists, known_gauge_exists, known_histogram_exists) after N attempts. Also, it may be worth a sleep between iterations so that the proxy has some time to do other things between probes.

Jose Nino added 3 commits December 7, 2020 20:02
fmt
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Dec 8, 2020

@antoniovicente thanks for your comment. It made me re-read the code carefully again and notice that this https://github.com/envoyproxy/envoy/pull/13919/files#diff-049bdaed2b6373b41b631030556ffa750151abac8d91526d515a61a6c5054b53R122 check was missing. I am guessing that in gcc the loop ordering was different than in clang. So in gcc the linked line was breaking out of the for loop into the while loop without getting to the summary metric. Thus keeping the while loop stuck forever until bazel times out the test.

Tests are now green

@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Dec 8, 2020

@snowp ready for a final stamp.

@antoniovicente
Copy link
Copy Markdown
Contributor

@antoniovicente thanks for your comment. It made me re-read the code carefully again and notice that this https://github.com/envoyproxy/envoy/pull/13919/files#diff-049bdaed2b6373b41b631030556ffa750151abac8d91526d515a61a6c5054b53R122 check was missing. I am guessing that in gcc the loop ordering was different than in clang. So in gcc the linked line was breaking out of the for loop into the while loop without getting to the summary metric. Thus keeping the while loop stuck forever until bazel times out the test.

Tests are now green

Fantastic.

@snowp snowp merged commit 1a9d72e into envoyproxy:master Dec 9, 2020
mpuncel added a commit to mpuncel/envoy that referenced this pull request Dec 11, 2020
* master:
  buffer: Optimize the layout of Slices in Buffer::OwnedImpl by removing subclassing and storing slice info directly in the SliceDeque (envoyproxy#14282)
  gRPC client to be used by ext_proc filter (envoyproxy#14283)
  http2: Add integration tests for PRIORITY frame flood mitigation for upstream servers (envoyproxy#14328)
  event: touch watchdog before execution of each post callback and before deferred deletion (envoyproxy#14339)
  stale: more allowed ops (envoyproxy#14345)
  stale: more changes (envoyproxy#14344)
  test: TODO fixup making enable_half_close private envoyproxy#14330)
  event: Reduce potential for lock contention while executing dispatcher post callbacks. (envoyproxy#14289)
  stale: fix config (envoyproxy#14337)
  metrics service sink: generalize the sink and grpc streamer for external use (envoyproxy#13919)
  wasm: update V8 to v8.8.278.8. (envoyproxy#14298)
  repo: switch to actions based stale bot (envoyproxy#14335)
  buffer: Use WatermarkFactory to create most WatermarkBuffer instances (envoyproxy#14256)

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
junr03 pushed a commit to envoyproxy/envoy-mobile that referenced this pull request Feb 3, 2021
Description: Replace the metrics sink from the envoy one to a custom one. The custom metrics sink comes with ack function on the grpc stream. This change is based on upstream envoy change: envoyproxy/envoy#13919
Risk Level: High
Testing: Local and ci unit tests and integration tests

Signed-off-by: Jingwei Hao <jingweih@lyft.com>
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: Replace the metrics sink from the envoy one to a custom one. The custom metrics sink comes with ack function on the grpc stream. This change is based on upstream envoy change: #13919
Risk Level: High
Testing: Local and ci unit tests and integration tests

Signed-off-by: Jingwei Hao <jingweih@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: Replace the metrics sink from the envoy one to a custom one. The custom metrics sink comes with ack function on the grpc stream. This change is based on upstream envoy change: #13919
Risk Level: High
Testing: Local and ci unit tests and integration tests

Signed-off-by: Jingwei Hao <jingweih@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants