server: factor out MainCommon as a class, with a run method#2568
server: factor out MainCommon as a class, with a run method#2568htuch merged 35 commits intoenvoyproxy:masterfrom
Conversation
… in main_common.cc. This sets the stage for another PR which will use the factored out MainCommon to performance test initialization without entering a main loop. Part of this PR is where we factor out all the initialization done prior to running the server loop into MainCommon's ctor. The listener loop is started by a run() method on the class. The orgaqnization is a little more complex than ideal because we want to, for one round of import, support the existing calling pattern of main_common, which leaves some of the initialization in main(). Another part of this CL is to fix a SEGV in the Server destruction process, if you ever instantiate a Server without running the listener loop and terminating via a signal or quitquitquit. Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…ain. Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
|
@dnoe can you take a pass on this? |
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
|
I'm iterating a little bit on some issues with main_common_test.cc across CI which I understand. The rest of the system should be stable. |
…har 0. Signed-off-by: Joshua Marantz <jmarantz@google.com>
My theory is that main_common's signal handlers are affecting the death-tests when linked into signals_test, which appears to happen during coverage tests, but not when tests are run individually. Signed-off-by: Joshua Marantz <jmarantz@google.com>
…G_COVERAGE didn't seem to work. Signed-off-by: Joshua Marantz <jmarantz@google.com>
… test during CI coverage. Also removes a filegroup for the sample config now that I added one that uses port 0. Signed-off-by: Joshua Marantz <jmarantz@google.com>
|
TWIMC: I have managed to get this through CI by #if-ing out main_common_test, which works fine in all scenarios except for the coverage tests. My hypothesis is that when MainCommon instantiates and destructing SignalAction, and that sequence alters the global signal handler state so that signals_test can no longer run properly. In coverage test, because reasons, all the _test.cc files are linked together so we don't get a new process boundary. I also attempted to use #ifndef ENVOY_CONFIG_COVERAGE which made my coverage tests kind of run test/run_envoy_bazel_coverage.sh locally, but they timed out and didn't give me any data, so I'm not sure if that resolved the issue. Committing that didn't work -- CI still failed in coverage/SignalsTest the same way. I then tried committing "#if 0" and that worked. Now I'm trying #if !ENVOY_CONFIG_COVERAGE. I'd like to be able to include main_common_test in coverage. I suspect that requires diving deep into signal_action.h to understand why it isn't restoring state fully, which feels too deep for this PR. |
…ing. Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
configs/google_com_proxy_port_0.json
Outdated
| @@ -0,0 +1,48 @@ | |||
| { | |||
There was a problem hiding this comment.
Best practice is to write new example configs in YAML; this the preferred human readable config representation.
configs/google_com_proxy_port_0.json
Outdated
| }], | ||
| "admin": { | ||
| "access_log_path": "/tmp/admin_access.log", | ||
| "address": "tcp://127.0.0.1:0" |
There was a problem hiding this comment.
Is there a reason we're using a v1 config here? In general, unless verifying v1 regression, we're trying to turn down v1 and only add new stuff as v2.
There was a problem hiding this comment.
I was being lazy and copied the first plausible config I found. Done.
| "//source/server/config_validation:server_lib", | ||
| ], | ||
| ] + select({ | ||
| "//bazel:disable_signal_trace": [], |
There was a problem hiding this comment.
Did you test with --define signal_trace=disabled? We don't have any CI tests to catch these variants of Bazel build, so when moving things around it's probably a good idea to manually verify.
There was a problem hiding this comment.
Great idea. Doing so now. Actually, why is this a compile-time decision rather than an option? Then it would be easier to test without rebuilding the entire tree. Same with HotRestart, though that's addressed in #2568.
There was a problem hiding this comment.
Some of these options are compile-time as we want to squash dependencies (e.g. Google gRPC) or demonstrate that site-local replacement or build time removal of features is possible (e.g. we replace during import on the Google side). Hot restart is probably a better contender to be fully dynamic, as it doesn't introduce any deps.
There's a wider conversation happening right now in #2069. Basically, we need some way to provide greater build time flexibility to allow very minimal Envoy binary deployments while also support kitchen sink builds with dynamic configuration. The former might be interesting in resource constrained embedded or container environments, the latter on a traditional fat server.
There was a problem hiding this comment.
I wasn't arguing against any compile-time switches, but it wasn't obvious what dependency this one had. I was thinking that if we were compiling on Windows and the APIs didn't exist, we'd just #if them out based on platform support as opposed to config, but keep the class API above that consistent.
There was a problem hiding this comment.
Yeah, the signal_trace deps are the Backtrace library, an external dependency.
source/exe/main_common.cc
Outdated
| } | ||
| }; | ||
| Server::DrainManagerPtr ProdComponentFactory::createDrainManager(Server::Instance& server) { | ||
| return Server::DrainManagerPtr{ |
There was a problem hiding this comment.
done (this code was kind of just moved, but fixed anyway).
source/server/server.cc
Outdated
|
|
||
| void InstanceImpl::terminate() { | ||
| { | ||
| std::unique_lock<std::mutex> lock(terminate_mutex_); |
There was a problem hiding this comment.
Why is this mutex protected? If we get here and we're not on the main thread, I think we're going to have a bad day, as most of the state that is touched below is local to the main thread.
There was a problem hiding this comment.
I didn't find a race, but I noticed a similar bool in ThreadLocalImpl::shutdown_ that was declared atomic. I was then going to make shutdown_ atomic but felt like the equivalent set-if-not-already-and-run-code flow was easier to read as a mutex.
I probably need to do some more debugging of a live server to understand which structures are live in which threads. Will investigate.
There was a problem hiding this comment.
I think shutdown is atomic as it's used for cross-thread communication, between main and workers. I think InstanceImp::terminate is only used on the main thread.
| #endif | ||
|
|
||
| std::unique_ptr<Envoy::OptionsImpl> options_; | ||
| MainCommonBase base_; |
There was a problem hiding this comment.
Curious why membership vs. inheritance was chosen here.
There was a problem hiding this comment.
I needed to compute the options_ before I could instantiate the Base, which takes Options& as a param for now, due to the legacy main_common. Hopefully this can all melt away once we remove our main_common call at Google.
source/exe/main_common.cc
Outdated
| #else | ||
| MainCommonBase main_common(options, false); | ||
| #endif | ||
| return main_common.run() ? 0 : 1; |
There was a problem hiding this comment.
I'm a slight fan of EXIT_SUCCESS and EXIT_FAILURE as per POSIX, but I realize this was like this before.
There was a problem hiding this comment.
Agreed. I had an irrational desire to make the diffs less. But that failed anyway. Fixing.
| Server::ProdComponentFactory component_factory; | ||
| auto local_address = Network::Utility::getLocalAddress(options.localAddressIpVersion()); | ||
| switch (options.mode()) { | ||
| bool MainCommonBase::run() { |
There was a problem hiding this comment.
Should we have a better name for this? This capture historically what was in the common main, but I think logically it's about overall server setup, teardown and run. We already have a Server::Instance, so maybe Server::Cli, Server::Launcher, etc.?
There was a problem hiding this comment.
I'm OK renaming it but that would also lose any kind of a/b correlation in the review. AFAIK there's no way to tell git that a file was renamed; it either figures it out or it doesn't, and mostly it doesn't :)
Can this be a follow-up?
There was a problem hiding this comment.
Sure. Either later or in a final pass before merging, either works.
There was a problem hiding this comment.
Thanks; I'll put this in a follow-up.
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Although it seems to work to declare main() as taking const char** argv, I wonder if there may be portability issues with that, and I decided to leave it at char**. Testing this is annoying because string literals are const. Signed-off-by: Joshua Marantz <jmarantz@google.com>
|
@jmarantz I can smoke test this at Lyft. Please let me know here that the change is ready to be plopped in a machine. If it is, I can do this tomorrow 02/13 |
htuch
left a comment
There was a problem hiding this comment.
LGTM modulo comments remaining and canary.
include/envoy/stats/stats.h
Outdated
|
|
||
| /** | ||
| * @return RawStatData* a raw stat data block for a given stat name or nullptr if there is no more | ||
| * memory available for stats. The allocator may return a reference counted data location |
There was a problem hiding this comment.
Technically, I think s/may return/should return/. We have an outstanding issue related to this at #2453
| "//source/server/config_validation:server_lib", | ||
| ], | ||
| ] + select({ | ||
| "//bazel:disable_signal_trace": [], |
There was a problem hiding this comment.
Yeah, the signal_trace deps are the Backtrace library, an external dependency.
test/exe/main_common_test.cc
Outdated
| namespace Envoy { | ||
|
|
||
| TEST(MainCommon, ConstructDestruct) { | ||
| // TODO(jmarantz): I think we may need to hack the config file to pick an unused port. |
There was a problem hiding this comment.
Why isn't port zero binding sufficient?
| dns_lookup_family: V4_ONLY | ||
| lb_policy: ROUND_ROBIN | ||
| hosts: [{ socket_address: { address: google.com, port_value: 443 }}] | ||
| tls_context: { sni: www.google.com } |
There was a problem hiding this comment.
You should be able to significantly reduce this down, since you don't need any TLS or DNS lookup, HCM etc. in a server startup example. See https://github.com/envoyproxy/envoy/blob/master/test/config/integration/server_ads.yaml for a very minimal example.
| return Envoy::main_common(*options); | ||
|
|
||
| // Run the server listener loop outside try/catch blocks, so that unexpected exceptions | ||
| // show up as a core-dumps for easier diagnostis. |
|
LGTM. It'll be a bit of work to switch our integration to this new main path, but no obvious problems. |
I attempted to use server_ads.yaml but it triggerd an unexpected exception in the yaml library. I'm adding instrumentation to track that in a different PR. Signed-off-by: Joshua Marantz <jmarantz@google.com>
|
Addressed comments from @htuch @junr03 can you kick off a smoke-test? Thanks! |
|
on it |
|
@jmarantz I put the binary on two of our machines (staging, canary) in our edge cluster. I did a full stop/start and a hot restart, and in both cases things look ok; server came up with no hiccups, serving traffic without any noticeable aberrations |
|
Thanks all! |
…nvoyproxy#2568)" This reverts commit deffec6, which was causing CI to flake today in TSAN runs. The proximate root cause is the change in envoyproxy#2568, which appears not to correctly setup logging in --mode=validate. This in turn leads to some log related races that I don't fully grok. Rolling back to unblock CI, we should make sure hotrestart_test passes TSAN with high confidence before rolling forward again. Signed-off-by: Harvey Tuch <htuch@google.com>
…2568)" (#2613) This reverts commit deffec6, which was causing CI to flake today in TSAN runs. The proximate root cause is the change in #2568, which appears not to correctly setup logging in --mode=validate. This in turn leads to some log related races that I don't fully grok. Rolling back to unblock CI, we should make sure hotrestart_test passes TSAN with high confidence before rolling forward again. Signed-off-by: Harvey Tuch <htuch@google.com>
…method (envoyproxy#2568)" (envoyproxy#2613)" This reverts commit 034135f. Signed-off-by: Joshua Marantz <jmarantz@google.com>
…ation (#2623) Description: Reverts #2613 the revert of #2568 Fixed the underlying issue which was that logging was not initialized during validation, so it ran lockless. This caused intermittent tsan errors. Although #2619 makes the failure immediate and consistent by asserting that logging has been initialized prior to spawning any threads. Risk Level: Medium -- the earlier #2568 caused intermittent tsan errors which we believe to be fixed, but #2568 was medium-risk in the first place. Release Notes: N/A
…gging before validation (envoyproxy#2623)" This reverts commit 89b1ea6.
Description: Converts EM's responseFlags to Cronet's API Errors Additional Description: Note that this change does not include ERR_NETWORK_CHANGED, ERR_INTERNET_DISCONNECT and QUIC_PROTOCOL_FAILED as they aren't implemented yet Risk Level: Low Testing: Test included Docs Changes: n/a Release Notes: n/a Signed-off-by: Chidera Olibie <colibie@google.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: Converts EM's responseFlags to Cronet's API Errors Additional Description: Note that this change does not include ERR_NETWORK_CHANGED, ERR_INTERNET_DISCONNECT and QUIC_PROTOCOL_FAILED as they aren't implemented yet Risk Level: Low Testing: Test included Docs Changes: n/a Release Notes: n/a Signed-off-by: Chidera Olibie <colibie@google.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description:
This is step 2 in the plan to improve startup performance. It makes it possible instantiate the full server stack without running the event loop, so the instantiation can be timed. This also simplifies the code required in main(), though a legacy path is retained allowing existing alternate main()s to continue to work.
Risk Level: Medium: this has been extensively tested but deserves scrutiny as it is part of the startup path.
Testing:
//test/... including sanitizers.
Added tests for the new MainCommon instantiation, as well as the legacy main_common() where some of the work was delegated to main().
This is another step toward addressing #2373