quic: add QUIC downstream connection close error stats.#16584
quic: add QUIC downstream connection close error stats.#16584mattklein123 merged 18 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Renjie Tang <renjietang@google.com>
|
/assign @alyssawilk |
alyssawilk
left a comment
There was a problem hiding this comment.
Hm.
At first pass my thought was that all our existing http3 stats are rooted in the HCM, and it'd be nice to be consistent here.
I'm not sure if that's the right thing here, because of the L4/L7 thing, but given that for TCP we don't have the stats in the connection, but we have the tcp network filter or the HCM do the accounting, I think a utility we can reuse when we add other protocols plus wiring it up in the HCM might be the right thing. Tagging @jmarantz for a second opinion as stats are more his area.
If we go that route, I think we could stash the close details in connection level stream info as one of those generic extension points, and pull it out in the HCM if it's present. We've already got a few h3 stats in the HCM which we don't bother compiling out, so as long as we added lazily I think it wouldn't be a problem.
|
I saw that HttpConnectionManagerConfig owns http3_codec_stats_ and passes it down to QuicHttpServerConnectionImpl. |
|
clang-tidy error needs attention: |
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
done |
jmarantz
left a comment
There was a problem hiding this comment.
This generally looks great! I just have a few comments about making the stats contexts parallel those for HTTP and GRPC for increased consistency and reduced number of args at various interface-points.
source/common/quic/quic_stats.cc
Outdated
| connection_close}); | ||
| } | ||
|
|
||
| Stats::StatName QuicStats::ConnectionCloseStatName(quic::QuicErrorCode error_code) { |
There was a problem hiding this comment.
connectionCloseStatName per Envoy style guide
source/server/server.cc
Outdated
|
|
||
| restarter_.initialize(*dispatcher_, *this); | ||
| drain_manager_ = component_factory.createDrainManager(*this); | ||
| #ifdef ENVOY_ENABLE_QUIC |
There was a problem hiding this comment.
is there also a runtime flag that should be checked? Or it's ifdef-only?
There was a problem hiding this comment.
The ifdef alone should suffice.
There was a problem hiding this comment.
Is there a configuration option for Quic? Or you have to make the decision about whether to support it at compile-time?
There was a problem hiding this comment.
The QUIC support is decided at compile time and is controlled by --//bazel:http3=True/False
source/server/listener_impl.h
Outdated
| // callback during the destroy of ListenerImpl. | ||
| Init::WatcherImpl local_init_watcher_; | ||
|
|
||
| Quic::QuicStats* quic_stats_; |
There was a problem hiding this comment.
Envoy style here is to use OptRef<Quic::QuicStats> here rather than a possibly null pointer.
See include/envoy/common/optref.h
| Network::ListenerConfig& listener_config, const quic::QuicConfig& quic_config, | ||
| Network::Socket::OptionsSharedPtr options, bool kernel_worker_routing, | ||
| const envoy::config::core::v3::RuntimeFeatureFlag& enabled) | ||
| const envoy::config::core::v3::RuntimeFeatureFlag& enabled, QuicStats* quic_stats) |
There was a problem hiding this comment.
I'd suggest bundling QuicStats and quic::QuicConfig into a class Quic::Context, which can parallel Http::Context in most ways.
That will cut down by one on the arg-count in a few interfaces in this PR, and give you other good places to put stuff that may need to be shared across workers.
Moreover, Quic::Context (or maybe OptRef<Quic::Context>) could then be returned from Server::FactoryContext, also parallel to Http::Context and Grpc::Context, cutting down on how much plumbing through needs to happen for those individually.
There was a problem hiding this comment.
I originally tried to make QuicStats behave similarly to Http::Context but found it to be quite complicated.
To have the server return QuicStats, we need a pure interface class so that Server::Instance can have a getter interface Quic::QuicStats quicStats().
However, the only public method in QuicStats, chargeQuicConnectionCloseStats(), is dependent on QuicErrorCode, which is defined in third_party library Quiche.
In my understanding, Envoy must to be able to build without quiche. And we probably shouldn't introduce external dependency in the include/ directory either.
Having this dependency makes it hard to follow existing pattern when implementing Quic stats.
There was a problem hiding this comment.
On this point, to match the pattern in the other Contexts is to forward-declare QuicStats without defining an interface for it.
So you'd have
namespace quic { class QuicConfig; }
namespace Envoy {
namespace Quic {
class QuicStats;
class Context {
public:
virtual QuicStats& stats() PURE:
virtual quic::QuicConfig& config() PURE;
};
...
So you don't need to expose the error-code type or other types at the interface.
| Network::ListenerConfig& listener_config, Server::ListenerStats& listener_stats, | ||
| Server::PerHandlerListenerStats& per_worker_stats, Event::Dispatcher& dispatcher, | ||
| Network::Socket& listen_socket) | ||
| Network::Socket& listen_socket, QuicStats* quic_stats) |
There was a problem hiding this comment.
per above, pass through QuicContext rather than QuicStats and QuicConfig separately.
| listener_stats_.downstream_cx_active_.dec(); | ||
| per_worker_stats_.downstream_cx_active_.dec(); | ||
| connection_handler_.decNumConnections(); | ||
| if (quic_stats_) { |
There was a problem hiding this comment.
I think you should just make QuicStats unconditional at this level. I don't see the problem with instantiating QuicStats in the places below where you are currently passing nullptr.
RenjieTang
left a comment
There was a problem hiding this comment.
Thanks for the review Joshua! Glad I got most of the QuicStats class right.
I left some in-place comments on the ownership and plumbing-through of QuicStats.
In general, I think it might not be the best choice to have the Server::InstanceImpl own QuicStats and pass it down all the way.
Alyssa mentioned in comments that a lot of existing QUIC stats live in HttpConnectionManagerConfig. I tried placing QuicStats there and found things to be a lot cleaner. The demo PR can be found at #16625. Let me know what you think on that.
| Network::ListenerConfig& listener_config, const quic::QuicConfig& quic_config, | ||
| Network::Socket::OptionsSharedPtr options, bool kernel_worker_routing, | ||
| const envoy::config::core::v3::RuntimeFeatureFlag& enabled) | ||
| const envoy::config::core::v3::RuntimeFeatureFlag& enabled, QuicStats* quic_stats) |
There was a problem hiding this comment.
I originally tried to make QuicStats behave similarly to Http::Context but found it to be quite complicated.
To have the server return QuicStats, we need a pure interface class so that Server::Instance can have a getter interface Quic::QuicStats quicStats().
However, the only public method in QuicStats, chargeQuicConnectionCloseStats(), is dependent on QuicErrorCode, which is defined in third_party library Quiche.
In my understanding, Envoy must to be able to build without quiche. And we probably shouldn't introduce external dependency in the include/ directory either.
Having this dependency makes it hard to follow existing pattern when implementing Quic stats.
source/server/server.cc
Outdated
|
|
||
| restarter_.initialize(*dispatcher_, *this); | ||
| drain_manager_ = component_factory.createDrainManager(*this); | ||
| #ifdef ENVOY_ENABLE_QUIC |
There was a problem hiding this comment.
The ifdef alone should suffice.
jmarantz
left a comment
There was a problem hiding this comment.
I'll check out the other PR a bit later.
| Network::ListenerConfig& listener_config, const quic::QuicConfig& quic_config, | ||
| Network::Socket::OptionsSharedPtr options, bool kernel_worker_routing, | ||
| const envoy::config::core::v3::RuntimeFeatureFlag& enabled) | ||
| const envoy::config::core::v3::RuntimeFeatureFlag& enabled, QuicStats* quic_stats) |
There was a problem hiding this comment.
On this point, to match the pattern in the other Contexts is to forward-declare QuicStats without defining an interface for it.
So you'd have
namespace quic { class QuicConfig; }
namespace Envoy {
namespace Quic {
class QuicStats;
class Context {
public:
virtual QuicStats& stats() PURE:
virtual quic::QuicConfig& config() PURE;
};
...
So you don't need to expose the error-code type or other types at the interface.
Can you explain the reasoning? |
I was mainly concerned that having QuicContext accessible in Server::Instance will require getter implementation in subclasses that don't really need it, for example Server::ValidationInstance. If having QuicStats in HCM works just as well, I'd prefer to go that route so that the code is cleaner. |
ListenerManagerImpl. Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
jmarantz
left a comment
There was a problem hiding this comment.
this is looking much better. I think we can just get rid of the one OptRef.
| } | ||
|
|
||
| #ifdef ENVOY_ENABLE_QUIC | ||
| new_listener->setQuicStatNames(quic_stat_names_); |
There was a problem hiding this comment.
As above I think ListenerImpl can just grab the QuicStatNames& object from the manager.
There was a problem hiding this comment.
Or...if that involves changes to the base class you'd prefer not to do, you can just pass in the stats to the constructor rather than setting it after construction.
That just means you need to ifdef the arg-list.
There was a problem hiding this comment.
Fortunately ListenerImpl takes ListenerManagerImpl in its constructors. So I added getter in ListenerManagerImpl and let ListenerImpl grab it during constructions.
Signed-off-by: Renjie Tang <renjietang@google.com>
|
/retest |
|
Retrying Azure Pipelines: |
jmarantz
left a comment
There was a problem hiding this comment.
@envoyproxy/senior-maintainers
thanks. for pushing this all the way through. nice job!
|
/assign-from @envoyproxy/senior-maintainers |
alyssawilk
left a comment
There was a problem hiding this comment.
Awesome - these are going to be super useful!
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
alyssawilk
left a comment
There was a problem hiding this comment.
One naming nit, and I think this is good to go from my point of view.
Adding Matt for a non-google pass since stats are forever :-P
| :header: Name, Type, Description | ||
| :widths: 1, 1, 2 | ||
|
|
||
| [direction].[source].quic_connection_close_error_code_[error_code], Counter, A collection of counters that are lazily initialized to record each quic connection close error code that's present. direction could be *upstream* or *downstream*. source could be *self* or *peer*. |
There was a problem hiding this comment.
If we're doccing up listener stats, won't direction always be downstream?
And actually I hate to catch this now, but probably rather than self/peer it should be tx/rx for consistency with other Envoy stats. I find self and peer more intuitive (maybe 'cause that's what I'm used to) but I think most of the other stats (like resets tx_reset/rx_reset) are documented that way with tx for things sent to the peer and rx being codes received from the peer.
There was a problem hiding this comment.
You are right. The listener stats should be all downstreams.
I haven't figured out the details on upstream stats. So I will update the doc again when I add them.
Done changing names to tx/rx.
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
|
Check CI? /wait |
Signed-off-by: Renjie Tang <renjietang@google.com>
|
@mattklein123 The flaky tests are now fixed. PTAL. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks LGTM with some small comments.
/wait
| // TODO(renjietang): Currently these stats are only available in downstream. Wire it up to upstream | ||
| // QUIC also. |
There was a problem hiding this comment.
From a quick read through it's not clear to me that this won't create an unbounded number of stats controlled by downstream (for exampled if the name somehow has the code number in it and the client controls the codes). This is probably not the case, but can you add more comments about why this is safe for use with uncontrolled peers?
There was a problem hiding this comment.
Great point!
I initially had an assertion in QuicStatNames::connectionCloseStatName() to make sure we don't create unlimited amount of stats.
But after some investigation, I found that if the connection close is initiated from peers, the error_code is parsed from the wire and no enum range checking is done. So an assertion here is too strong and malicious clients might be able to attack Envoy by sending out-of-range error codes.
Instead, I now added a check in QuicStatNames::chargeQuicConnectionCloseStats() to ignore out-of-range error_codes and log a warning.
I will follow up to dig deeper into the quiche code and see if bad error code handling can be done earlier.
Signed-off-by: Renjie Tang <renjietang@google.com>
| ASSERT(&symbol_table_ == &scope.symbolTable()); | ||
|
|
||
| if (error_code >= quic::QUIC_LAST_ERROR) { | ||
| ENVOY_LOG(warn, fmt::format("Error code {} is out of range of QuicErrorCodes.", error_code)); |
There was a problem hiding this comment.
Can you add a test case for this? Also, I think warning here can log spam. I would recommend making this debug level and potentially also adding a stat for unknown error code or something like that. Thank you.
/wait
There was a problem hiding this comment.
done.
I utilized quic::QUIC_LAST_ERROR_CODE as a stat for all out of range codes.
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
…6584) Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang renjietang@google.com
Commit Message: Introduce QuicStatNames and add Quic connection close stats for downstream connections.
Risk Level: Low
Testing: unit tests
Docs Changes: n/a
Release Notes: n/a.
Platform Specific Features: n/a.