ext_auth: add option to measure timeout when check request is created.#12778
ext_auth: add option to measure timeout when check request is created.#12778htuch merged 22 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
|
/retest |
|
Retrying Azure Pipelines, to retry CircleCI checks, use |
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
| // difference between the two, if requests to the ext authz server become pending due to circuit | ||
| // breaking (i.e. by default, the duration while the request is pending is not counted towards | ||
| // the configured timeout). This is ignored for google_grpc service type, as it does not use the envoy | ||
| // io facilities. |
There was a problem hiding this comment.
Another way of approaching this would be to change the existing behavior (under runtime flag guard) to start the timeout from when the request is created, arguably this is the behavior end users expect. We've had a recent discussion with some of the gRPC folks on some other timeouts that fit this pattern (@markdroth @dfawley).
In your proposal, you add an internal timer; if we use this exclusively, this would avoid having to have end users worry about the difference between Envoy and Google gRPC implementations.
There was a problem hiding this comment.
Thanks for the feedback @htuch !
i agree that using the runtime is preferable. the only reason i added it as a config option is for backwards compatibility.
are there any other stake holders in the community that you think might object? or should I go ahead and change this?
should the new behavior (internal timer) be the default, with runtime flag to revert? or the other way around?
There was a problem hiding this comment.
My vote would be runtime flag, default on, since most folks won't be relying on behavior that occurs during circuit breaker activation. That said, it is kind of hard to reason about the ideal behavior here under circuit breaker overload; currently we queue and drop when the queue is too long, we'll be encouraging more drop due to timeouts. @mattklein123 WDYT?
There was a problem hiding this comment.
Unless there are objections - I will adjust the code to perform the new behavior by default, with runtime flag to turn off.
There was a problem hiding this comment.
My vote would be runtime flag, default on, since most folks won't be relying on behavior that occurs during circuit breaker activation. That said, it is kind of hard to reason about the ideal behavior here under circuit breaker overload; currently we queue and drop when the queue is too long, we'll be encouraging more drop due to timeouts. @mattklein123 WDYT?
Sorry. I very much missed this ping and I wished I hadn't. FWIW I would have probably figured out how to make this change in a more core way (potentially via a different optional param fed to the async client), mainly so that it would apply to other side calls also. Since I think ratelimit and other services would face the same problem. We can potentially consider doing this in a follow up.
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
|
@htuch, this is ready for another pass |
htuch
left a comment
There was a problem hiding this comment.
Looks great, a few comments..
/wait
| RawAsyncStreamCallbacks& callbacks, | ||
| const Http::AsyncClient::StreamOptions& options) override; | ||
| // This does not use the worker thread's dispatcher, so we must return nullptr. | ||
| Event::Dispatcher* dispatcher() override { return nullptr; } |
There was a problem hiding this comment.
I don't think you need to plumb the dispatcher from the gRPC client. The filter should be able to obtain this from stream decoder callbacks, see
envoy/include/envoy/http/filter.h
Line 174 in 419428a
There was a problem hiding this comment.
I can do this by changing the client check (here:
) and add a dispatcher argument.I thought it was cleaner to not introduce a dispatcher arg to the check request:
- interface remains the same
- to my understanding, we still want to use the google client's native timeout? if so, i won't have a way to tell them google and native client apart (i.e. they will both use a dispatcher for timeout)
Would be happy to to change this if you still think it's the better way to go.
There was a problem hiding this comment.
- It's fine to add to the
checkargs a dispatcher ref. It could also be done at client creation time, but I think we will make this TLS in the near future, so fine to add tocheck. - Not sure I fully grok how the Google client's native timeout interacts with the choice of how we plumb dispatcher.
There was a problem hiding this comment.
- sounds good
- The auth-client doesn't know if it has a google grpc client or native grpc client. if it gets a dispatcher in the check method args, it will always use it for timeout purposes (as it can't tell the grpc clients apart). If an optional dispatcher is returned from the grpc client, the auth-client will only use it if it exists (i.e. will not use it for google grpc, instead it will set the request options timeout).
There was a problem hiding this comment.
I see on (2). Two options I see:
A. Avoid using Google gRPC client timeout, just do a single local timeout in ext_authz no matter what.
B. Fix Envoy gRPC client to do a timeout consistent with Google gRPC. I'm fine punting this to a TODO for the future. It seems this would reduce confusion systematically though.
There was a problem hiding this comment.
sounds good! thank you for the feedback
| context.runtime(), context.httpContext(), stats_prefix); | ||
| Http::FilterFactoryCb callback; | ||
|
|
||
| bool internal_timeout = Runtime::runtimeFeatureEnabled( |
There was a problem hiding this comment.
Nit: const bool and maybe const bool timeout_starts_at_check_creation
| client_->cancel(); | ||
| EXPECT_EQ(timer_destroyed, true); | ||
| } | ||
|
|
There was a problem hiding this comment.
Should we have a full integration test in test/extensions/filters/http/ext_authz/ext_authz_integration_test.cc etc?
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
|
/retest |
|
Retrying Azure Pipelines, to retry CircleCI checks, use |
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
…base class Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
htuch
left a comment
There was a problem hiding this comment.
Looks great, neat. A few comments/questions but seems basically ready to ship.
| if (timeoutStartsAtCheckCreation()) { | ||
| // TODO(yuval-k): We currently use dispatcher based timeout even if the underlying client is | ||
| // google gRPC client, which has it's own timeout mechanism. We may want to change that if | ||
| // using the google gRPC client timeout is more efficient. |
There was a problem hiding this comment.
I'm thinking it's not really going to be an efficiency call, maybe it should really be that one day we will hopefully converge the two implementations of gRPC in Envoy, but that's going to be far off AFAICT. FYI @markdroth
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
|
i'm seeing weird clang tidy failures: not sure what I'm doing wrong here :/ full log: https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/51528/logs/90 |
|
my usage of EXPECT_LOG_CONTAINS breaks tsan; tsan errorWARNING: ThreadSanitizer: data race (pid=15)
Write of size 8 at 0x7fffef749828 by thread T10 (mutexes: read M2779, write M0):
#0 std::__1::vector, std::__1::allocator >, std::__1::allocator, std::__1::allocator > > >::_ConstructTransaction::~_ConstructTransaction() /opt/llvm/bin/../include/c++/v1/vector:903:19 (ext_authz_integration_test+0x2d7287a)
#1 void std::__1::vector, std::__1::allocator >, std::__1::allocator, std::__1::allocator > > >::__construct_one_at_end, std::__1::allocator > >(std::__1::basic_string, std::__1::allocator >&&) /opt/llvm/bin/../include/c++/v1/vector:927:3 (ext_authz_integration_test+0x2d7249e)
#2 std::__1::vector, std::__1::allocator >, std::__1::allocator, std::__1::allocator > > >::push_back(std::__1::basic_string, std::__1::allocator >&&) /opt/llvm/bin/../include/c++/v1/vector:1652:9 (ext_authz_integration_test+0x2d712ae)
#3 Envoy::LogRecordingSink::log(std::__1::basic_string_view >) /proc/self/cwd/test/test_common/logging.cc:37:13 (ext_authz_integration_test+0x599f723)
#4 Envoy::Logger::DelegatingLogSink::log(spdlog::details::log_msg const&) /proc/self/cwd/source/common/common/logger.cc:90:12 (ext_authz_integration_test+0x818fde2)
#5 spdlog::logger::sink_it_(spdlog::details::log_msg const&) /proc/self/cwd/external/com_github_gabime_spdlog/include/spdlog/logger-inl.h:186:23 (ext_authz_integration_test+0x2c8b3ab)
#6 spdlog::logger::log_it_(spdlog::details::log_msg const&, bool, bool) /proc/self/cwd/external/com_github_gabime_spdlog/include/spdlog/logger-inl.h:170:9 (ext_authz_integration_test+0x2cdb37a)
#7 void spdlog::logger::log_, unsigned long, unsigned long>(spdlog::source_loc, spdlog::level::level_enum, fmt::v7::basic_string_view const&, unsigned long const&, unsigned long const&) /proc/self/cwd/external/com_github_gabime_spdlog/include/spdlog/logger.h:342:13 (ext_authz_integration_test+0x2e86b18)
#8 void spdlog::logger::log(spdlog::source_loc, spdlog::level::level_enum, fmt::v7::basic_string_view, unsigned long const&, unsigned long const&) /proc/self/cwd/external/com_github_gabime_spdlog/include/spdlog/logger.h:87:9 (ext_authz_integration_test+0x2e7522a)
#9 Envoy::Http::Legacy::Http2::ConnectionImpl::onFrameReceived(nghttp2_frame const*) /proc/self/cwd/source/common/http/http2/codec_impl_legacy.cc:656:3 (ext_authz_integration_test+0x5456057)
#10 Envoy::Http::Legacy::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_15::operator()(nghttp2_session*, nghttp2_frame const*, void*) const /proc/self/cwd/source/common/http/http2/codec_impl_legacy.cc:1158:57 (ext_authz_integration_test+0x5463620)
#11 Envoy::Http::Legacy::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_15::__invoke(nghttp2_session*, nghttp2_frame const*, void*) /proc/self/cwd/source/common/http/http2/codec_impl_legacy.cc:1157:19 (ext_authz_integration_test+0x5463598)
#12 session_call_on_frame_received /b/f/w/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:3299:10 (ext_authz_integration_test+0x65127ae)
#13 nghttp2_session_on_settings_received /b/f/w/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4422:12 (ext_authz_integration_test+0x65137a6)
#14 session_process_settings_frame /b/f/w/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:4588:10 (ext_authz_integration_test+0x652380a)
#15 nghttp2_session_mem_recv /b/f/w/external/com_github_nghttp2_nghttp2/lib/nghttp2_session.c:6332:12 (ext_authz_integration_test+0x651d088)
#16 Envoy::Http::Legacy::Http2::ConnectionImpl::innerDispatch(Envoy::Buffer::Instance&) /proc/self/cwd/source/common/http/http2/codec_impl_legacy.cc:574:9 (ext_authz_integration_test+0x54544be)
#17 Envoy::Http::Legacy::Http2::ServerConnectionImpl::innerDispatch(Envoy::Buffer::Instance&) /proc/self/cwd/source/common/http/http2/codec_impl_legacy.cc:1484:26 (ext_authz_integration_test+0x5461b34)
#18 Envoy::Http::Legacy::Http2::ServerConnectionImpl::dispatch(Envoy::Buffer::Instance&)::$_25::operator()(Envoy::Buffer::Instance&) const /proc/self/cwd/source/common/http/http2/codec_impl_legacy.cc:1470:60 (ext_authz_integration_test+0x5478d44)
#19 decltype(std::__1::forward(fp)(std::__1::forward(fp0))) std::__1::__invoke(Envoy::Http::Legacy::Http2::ServerConnectionImpl::dispatch(Envoy::Buffer::Instance&)::$_25&, Envoy::Buffer::Instance&) /opt/llvm/bin/../include/c++/v1/type_traits:3539:1 (ext_authz_integration_test+0x5478c54)
#20 absl::Status std::__1::__invoke_void_return_wrapper::__call(Envoy::Http::Legacy::Http2::ServerConnectionImpl::dispatch(Envoy::Buffer::Instance&)::$_25&, Envoy::Buffer::Instance&) /opt/llvm/bin/../include/c++/v1/__functional_base:317:16 (ext_authz_integration_test+0x5478b84)
#21 std::__1::__function::__alloc_func, absl::Status (Envoy::Buffer::Instance&)>::operator()(Envoy::Buffer::Instance&) /opt/llvm/bin/../include/c++/v1/functional:1540:16 (ext_authz_integration_test+0x5478af4)
#22 std::__1::__function::__func, absl::Status (Envoy::Buffer::Instance&)>::operator()(Envoy::Buffer::Instance&) /opt/llvm/bin/../include/c++/v1/functional:1714:12 (ext_authz_integration_test+0x5476d93)
#23 std::__1::__function::__value_func::operator()(Envoy::Buffer::Instance&) const /opt/llvm/bin/../include/c++/v1/functional:1867:16 (ext_authz_integration_test+0x4382d5b)
#24 std::__1::function::operator()(Envoy::Buffer::Instance&) const /opt/llvm/bin/../include/c++/v1/functional:2473:12 (ext_authz_integration_test+0x4382c6f)
#25 Envoy::Http::Utility::exceptionToStatus(std::__1::function, Envoy::Buffer::Instance&) /proc/self/cwd/source/common/http/utility.cc:41:14 (ext_authz_integration_test+0x5f64aa0)
#26 Envoy::Http::Legacy::Http2::ServerConnectionImpl::dispatch(Envoy::Buffer::Instance&) /proc/self/cwd/source/common/http/http2/codec_impl_legacy.cc:1469:10 (ext_authz_integration_test+0x54615e2)
#27 Envoy::Http::ConnectionManagerImpl::onData(Envoy::Buffer::Instance&, bool) /proc/self/cwd/source/common/http/conn_manager_impl.cc:281:35 (ext_authz_integration_test+0x4ec4205)
#28 Envoy::Network::FilterManagerImpl::onContinueReading(Envoy::Network::FilterManagerImpl::ActiveReadFilter*, Envoy::Network::ReadBufferSource&) /proc/self/cwd/source/common/network/filter_manager_impl.cc:66:48 (ext_authz_integration_test+0x5be5ce8)
#29 Envoy::Network::FilterManagerImpl::onRead() /proc/self/cwd/source/common/network/filter_manager_impl.cc:76:3 (ext_authz_integration_test+0x5be5fe1)
#30 Envoy::Network::ConnectionImpl::onRead(unsigned long) /proc/self/cwd/source/common/network/connection_impl.cc:290:19 (ext_authz_integration_test+0x5bba8b2)
#31 Envoy::Network::ConnectionImpl::onReadReady() /proc/self/cwd/source/common/network/connection_impl.cc:572:5 (ext_authz_integration_test+0x5bc1bba)
#32 Envoy::Network::ConnectionImpl::onFileEvent(unsigned int) /proc/self/cwd/source/common/network/connection_impl.cc:532:5 (ext_authz_integration_test+0x5bbf38f)
#33 Envoy::Network::ConnectionImpl::ConnectionImpl(Envoy::Event::Dispatcher&, std::__1::unique_ptr >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6::operator()(unsigned int) const /proc/self/cwd/source/common/network/connection_impl.cc:70:54 (ext_authz_integration_test+0x5bd6707)
#34 decltype(std::__1::forward >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6&>(fp)(std::__1::forward(fp0))) std::__1::__invoke >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6&, unsigned int>(Envoy::Network::ConnectionImpl::ConnectionImpl(Envoy::Event::Dispatcher&, std::__1::unique_ptr >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6&, unsigned int&&) /opt/llvm/bin/../include/c++/v1/type_traits:3539:1 (ext_authz_integration_test+0x5bd664c)
#35 void std::__1::__invoke_void_return_wrapper::__call >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6&, unsigned int>(Envoy::Network::ConnectionImpl::ConnectionImpl(Envoy::Event::Dispatcher&, std::__1::unique_ptr >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6&, unsigned int&&) /opt/llvm/bin/../include/c++/v1/__functional_base:348:9 (ext_authz_integration_test+0x5bd657d)
#36 std::__1::__function::__alloc_func >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6, std::__1::allocator >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6>, void (unsigned int)>::operator()(unsigned int&&) /opt/llvm/bin/../include/c++/v1/functional:1540:16 (ext_authz_integration_test+0x5bd64fd)
#37 std::__1::__function::__func >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6, std::__1::allocator >&&, std::__1::unique_ptr >&&, Envoy::StreamInfo::StreamInfo&, bool)::$_6>, void (unsigned int)>::operator()(unsigned int&&) /opt/llvm/bin/../include/c++/v1/functional:1714:12 (ext_authz_integration_test+0x5bd47ac)
#38 std::__1::__function::__value_func::operator()(unsigned int&&) const /opt/llvm/bin/../include/c++/v1/functional:1867:16 (ext_authz_integration_test+0x43c1684)
#39 std::__1::function::operator()(unsigned int) const /opt/llvm/bin/../include/c++/v1/functional:2473:12 (ext_authz_integration_test+0x43c1578)
#40 Envoy::Event::FileEventImpl::mergeInjectedEventsAndRunCb(unsigned int) /proc/self/cwd/source/common/event/file_event_impl.cc:133:3 (ext_authz_integration_test+0x5ba3ea9)
#41 Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*)::$_1::operator()(int, short, void*) const /proc/self/cwd/source/common/event/file_event_impl.cc:106:16 (ext_authz_integration_test+0x5ba419d)
#42 Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*)::$_1::__invoke(int, short, void*) /proc/self/cwd/source/common/event/file_event_impl.cc:90:7 (ext_authz_integration_test+0x5ba3f26)
#43 event_persist_closure /b/f/w/external/com_github_libevent_libevent/event.c:1645:9 (ext_authz_integration_test+0x810c49f)
#44 event_process_active_single_queue /b/f/w/external/com_github_libevent_libevent/event.c:1704:4 (ext_authz_integration_test+0x810adec)
#45 event_process_active /b/f/w/external/com_github_libevent_libevent/event.c:1805:9 (ext_authz_integration_test+0x80ff897)
#46 event_base_loop /b/f/w/external/com_github_libevent_libevent/event.c:2047:12 (ext_authz_integration_test+0x80fd24f)
#47 Envoy::Event::LibeventScheduler::run(Envoy::Event::Dispatcher::RunType) /proc/self/cwd/source/common/event/libevent_scheduler.cc:60:3 (ext_authz_integration_test+0x64f685b)
#48 Envoy::Event::DispatcherImpl::run(Envoy::Event::Dispatcher::RunType) /proc/self/cwd/source/common/event/dispatcher_impl.cc:204:19 (ext_authz_integration_test+0x5b8e671)
#49 Envoy::Server::WorkerImpl::threadRoutine(Envoy::Server::GuardDog&) /proc/self/cwd/source/server/worker_impl.cc:133:16 (ext_authz_integration_test+0x33ac5a5)
#50 Envoy::Server::WorkerImpl::start(Envoy::Server::GuardDog&)::$_4::operator()() const /proc/self/cwd/source/server/worker_impl.cc:99:38 (ext_authz_integration_test+0x33bac87)
#51 decltype(std::__1::forward(fp)()) std::__1::__invoke(Envoy::Server::WorkerImpl::start(Envoy::Server::GuardDog&)::$_4&) /opt/llvm/bin/../include/c++/v1/type_traits:3539:1 (ext_authz_integration_test+0x33babc0)
#52 void std::__1::__invoke_void_return_wrapper::__call(Envoy::Server::WorkerImpl::start(Envoy::Server::GuardDog&)::$_4&) /opt/llvm/bin/../include/c++/v1/__functional_base:348:9 (ext_authz_integration_test+0x33bab20)
#53 std::__1::__function::__alloc_func, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1540:16 (ext_authz_integration_test+0x33baac0)
#54 std::__1::__function::__func, void ()>::operator()() /opt/llvm/bin/../include/c++/v1/functional:1714:12 (ext_authz_integration_test+0x33b8d8f)
#55 std::__1::__function::__value_func::operator()() const /opt/llvm/bin/../include/c++/v1/functional:1867:16 (ext_authz_integration_test+0x3005896)
#56 std::__1::function::operator()() const /opt/llvm/bin/../include/c++/v1/functional:2473:12 (ext_authz_integration_test+0x3005528)
#57 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::__1::function, std::__1::optional const&)::'lambda'(void*)::operator()(void*) const /proc/self/cwd/source/common/common/posix/thread_impl.cc:49:11 (ext_authz_integration_test+0x80de94b)
#58 Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::__1::function, std::__1::optional const&)::'lambda'(void*)::__invoke(void*) /proc/self/cwd/source/common/common/posix/thread_impl.cc:48:9 (ext_authz_integration_test+0x80de8d8)
I'm trying to see how to fix this. if anyone else has an idea - feel free to send it my way |
|
@yuval-k not sure, but maybe the stats approach wouldn't be too much work and be more reliable here. |
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
|
adding the timeout stat was a bit more involved than I thought, but i think it's a good idea regardless; thanks @htuch |
Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
|
/retest |
|
Retrying Azure Pipelines, to retry CircleCI checks, use |
|
Yep, tihs is a flake from catching bugs and failing a test run with the bug, I'm on it now |
|
/retest |
|
Retrying Azure Pipelines, to retry CircleCI checks, use |
htuch
left a comment
There was a problem hiding this comment.
LGTM, thanks! Thanks for iterating, this looks really solid.
… created. (envoyproxy#12778)" This reverts commit dfa85e8. Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
… created. (envoyproxy#12778)" This reverts commit dfa85e8. Signed-off-by: Yuval Kohavi <yuval.kohavi@gmail.com>
Commit Message:
ext_auth: add option to measure timeout when check request is created. In the current default, the timeout timer starts when the request is active. If the request is temporary pending due to the circuit breaker, the overall timeout may exceed the time out configured.
Additional Description:
Currently timeout is handled by setting the RequestOptions on the async client. The async client starts the timeout timer when the request processing starts.
When circuit breakers to the ext authz cluster are configured, a request may be temporary in pending state (if the number of active requests is above the circuit breaker threshold). this period of time is not accounted for with the current timer, which can impact over-all latency.
To address this, I've added a new config option to http ext_authz filter:
measure_timeout_on_check_created. When set, the timeout timer will be started when the check request is created.Risk Level: Low (opt-in)
Testing: Unit tests added
Docs Changes: Inline proto docs
Release Notes: Release notes added.