Skip to content

server: factor out MainCommon as a class, with a run method#2568

Merged
htuch merged 35 commits intoenvoyproxy:masterfrom
jmarantz:main-common-as-class
Feb 13, 2018
Merged

server: factor out MainCommon as a class, with a run method#2568
htuch merged 35 commits intoenvoyproxy:masterfrom
jmarantz:main-common-as-class

Conversation

@jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Feb 9, 2018

Description:
This is step 2 in the plan to improve startup performance. It makes it possible instantiate the full server stack without running the event loop, so the instantiation can be timed. This also simplifies the code required in main(), though a legacy path is retained allowing existing alternate main()s to continue to work.

Risk Level: Medium: this has been extensively tested but deserves scrutiny as it is part of the startup path.

Testing:
//test/... including sanitizers.
Added tests for the new MainCommon instantiation, as well as the legacy main_common() where some of the work was delegated to main().

This is another step toward addressing #2373

… in main_common.cc.

This sets the stage for another PR which will use the factored out
MainCommon to performance test initialization without entering a main
loop.

Part of this PR is where we factor out all the initialization done
prior to running the server loop into MainCommon's ctor.  The listener
loop is started by a run() method on the class.

The orgaqnization is a little more complex than ideal because we want
to, for one round of import, support the existing calling pattern of
main_common, which leaves some of the initialization in main().

Another part of this CL is to fix a SEGV in the Server destruction
process, if you ever instantiate a Server without running the listener
loop and terminating via a signal or quitquitquit.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…ain.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@mattklein123
Copy link
Member

@dnoe can you take a pass on this?

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Feb 9, 2018

I'm iterating a little bit on some issues with main_common_test.cc across CI which I understand. The rest of the system should be stable.

…har 0.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
My theory is that main_common's signal handlers are affecting the death-tests
when linked into signals_test, which appears to happen during coverage tests,
but not when tests are run individually.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
…G_COVERAGE didn't seem to work.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
… test during CI coverage.

Also removes a filegroup for the sample config now that I added one that uses port 0.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

TWIMC: I have managed to get this through CI by #if-ing out main_common_test, which works fine in all scenarios except for the coverage tests. My hypothesis is that when MainCommon instantiates and destructing SignalAction, and that sequence alters the global signal handler state so that signals_test can no longer run properly. In coverage test, because reasons, all the _test.cc files are linked together so we don't get a new process boundary.

I also attempted to use #ifndef ENVOY_CONFIG_COVERAGE which made my coverage tests kind of run test/run_envoy_bazel_coverage.sh locally, but they timed out and didn't give me any data, so I'm not sure if that resolved the issue. Committing that didn't work -- CI still failed in coverage/SignalsTest the same way. I then tried committing "#if 0" and that worked. Now I'm trying #if !ENVOY_CONFIG_COVERAGE.

I'd like to be able to include main_common_test in coverage. I suspect that requires diving deep into signal_action.h to understand why it isn't restoring state fully, which feels too deep for this PR.

…ing.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@htuch htuch self-assigned this Feb 11, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks good.

@@ -0,0 +1,48 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is to write new example configs in YAML; this the preferred human readable config representation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}],
"admin": {
"access_log_path": "/tmp/admin_access.log",
"address": "tcp://127.0.0.1:0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're using a v1 config here? In general, unless verifying v1 regression, we're trying to turn down v1 and only add new stuff as v2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being lazy and copied the first plausible config I found. Done.

"//source/server/config_validation:server_lib",
],
] + select({
"//bazel:disable_signal_trace": [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test with --define signal_trace=disabled? We don't have any CI tests to catch these variants of Bazel build, so when moving things around it's probably a good idea to manually verify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. Doing so now. Actually, why is this a compile-time decision rather than an option? Then it would be easier to test without rebuilding the entire tree. Same with HotRestart, though that's addressed in #2568.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these options are compile-time as we want to squash dependencies (e.g. Google gRPC) or demonstrate that site-local replacement or build time removal of features is possible (e.g. we replace during import on the Google side). Hot restart is probably a better contender to be fully dynamic, as it doesn't introduce any deps.

There's a wider conversation happening right now in #2069. Basically, we need some way to provide greater build time flexibility to allow very minimal Envoy binary deployments while also support kitchen sink builds with dynamic configuration. The former might be interesting in resource constrained embedded or container environments, the latter on a traditional fat server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't arguing against any compile-time switches, but it wasn't obvious what dependency this one had. I was thinking that if we were compiling on Windows and the APIs didn't exist, we'd just #if them out based on platform support as opposed to config, but keep the class API above that consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the signal_trace deps are the Backtrace library, an external dependency.

}
};
Server::DrainManagerPtr ProdComponentFactory::createDrainManager(Server::Instance& server) {
return Server::DrainManagerPtr{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: std::make_unique

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (this code was kind of just moved, but fixed anyway).


void InstanceImpl::terminate() {
{
std::unique_lock<std::mutex> lock(terminate_mutex_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this mutex protected? If we get here and we're not on the main thread, I think we're going to have a bad day, as most of the state that is touched below is local to the main thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a race, but I noticed a similar bool in ThreadLocalImpl::shutdown_ that was declared atomic. I was then going to make shutdown_ atomic but felt like the equivalent set-if-not-already-and-run-code flow was easier to read as a mutex.

I probably need to do some more debugging of a live server to understand which structures are live in which threads. Will investigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think shutdown is atomic as it's used for cross-thread communication, between main and workers. I think InstanceImp::terminate is only used on the main thread.

#endif

std::unique_ptr<Envoy::OptionsImpl> options_;
MainCommonBase base_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why membership vs. inheritance was chosen here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to compute the options_ before I could instantiate the Base, which takes Options& as a param for now, due to the legacy main_common. Hopefully this can all melt away once we remove our main_common call at Google.

#else
MainCommonBase main_common(options, false);
#endif
return main_common.run() ? 0 : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a slight fan of EXIT_SUCCESS and EXIT_FAILURE as per POSIX, but I realize this was like this before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I had an irrational desire to make the diffs less. But that failed anyway. Fixing.

Server::ProdComponentFactory component_factory;
auto local_address = Network::Utility::getLocalAddress(options.localAddressIpVersion());
switch (options.mode()) {
bool MainCommonBase::run() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a better name for this? This capture historically what was in the common main, but I think logically it's about overall server setup, teardown and run. We already have a Server::Instance, so maybe Server::Cli, Server::Launcher, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK renaming it but that would also lose any kind of a/b correlation in the review. AFAIK there's no way to tell git that a file was renamed; it either figures it out or it doesn't, and mostly it doesn't :)

Can this be a follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Either later or in a final pass before merging, either works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; I'll put this in a follow-up.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Although it seems to work to declare main() as taking const char**
argv, I wonder if there may be portability issues with that, and I decided
to leave it at char**.  Testing this is annoying because string literals
are const.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@junr03
Copy link
Member

junr03 commented Feb 12, 2018

@jmarantz I can smoke test this at Lyft. Please let me know here that the change is ready to be plopped in a machine. If it is, I can do this tomorrow 02/13

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo comments remaining and canary.


/**
* @return RawStatData* a raw stat data block for a given stat name or nullptr if there is no more
* memory available for stats. The allocator may return a reference counted data location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I think s/may return/should return/. We have an outstanding issue related to this at #2453

"//source/server/config_validation:server_lib",
],
] + select({
"//bazel:disable_signal_trace": [],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the signal_trace deps are the Backtrace library, an external dependency.

namespace Envoy {

TEST(MainCommon, ConstructDestruct) {
// TODO(jmarantz): I think we may need to hack the config file to pick an unused port.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't port zero binding sufficient?

dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
hosts: [{ socket_address: { address: google.com, port_value: 443 }}]
tls_context: { sni: www.google.com }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to significantly reduce this down, since you don't need any TLS or DNS lookup, HCM etc. in a server startup example. See https://github.com/envoyproxy/envoy/blob/master/test/config/integration/server_ads.yaml for a very minimal example.

Copy link
Contributor

@dnoe dnoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmyerson Please take a look at this to make sure there aren't any obvious problems with our integration.

@jmarantz You might want to try doing a Google import and test to do our equivalent of the Lyft smoke test. There've been tricky problems around lifecycle before.

return Envoy::main_common(*options);

// Run the server listener loop outside try/catch blocks, so that unexpected exceptions
// show up as a core-dumps for easier diagnostis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: diagnostics

@kmyerson
Copy link
Contributor

LGTM. It'll be a bit of work to switch our integration to this new main path, but no obvious problems.

I attempted to use server_ads.yaml but it triggerd an unexpected exception in the yaml
library.  I'm adding instrumentation to track that in a different PR.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

Addressed comments from @htuch
@kmyerson existing calls to main_common() should work, and that is at least somewhat tested in main_common_test.cc. The intent is for us to migrate off that legacy path after integrating this, and then the legacy main_common() support can be deleted.

@junr03 can you kick off a smoke-test? Thanks!

@junr03
Copy link
Member

junr03 commented Feb 13, 2018

on it

@junr03
Copy link
Member

junr03 commented Feb 13, 2018

@jmarantz I put the binary on two of our machines (staging, canary) in our edge cluster. I did a full stop/start and a hot restart, and in both cases things look ok; server came up with no hiccups, serving traffic without any noticeable aberrations

@htuch htuch merged commit deffec6 into envoyproxy:master Feb 13, 2018
@jmarantz
Copy link
Contributor Author

Thanks all!

@jmarantz jmarantz deleted the main-common-as-class branch February 13, 2018 22:45
@htuch
Copy link
Member

htuch commented Feb 13, 2018

@junr03 @jmarantz Merged. We will roll back if there is any major breakage reported in next 48h.

htuch added a commit to htuch/envoy that referenced this pull request Feb 15, 2018
…nvoyproxy#2568)"

This reverts commit deffec6, which was
causing CI to flake today in TSAN runs. The proximate root cause is
the change in envoyproxy#2568, which appears not to correctly setup logging in
--mode=validate. This in turn leads to some log related races that I
don't fully grok.

Rolling back to unblock CI, we should make sure hotrestart_test passes
TSAN with high confidence before rolling forward again.

Signed-off-by: Harvey Tuch <htuch@google.com>
htuch added a commit that referenced this pull request Feb 15, 2018
…2568)" (#2613)

This reverts commit deffec6, which was
causing CI to flake today in TSAN runs. The proximate root cause is
the change in #2568, which appears not to correctly setup logging in
--mode=validate. This in turn leads to some log related races that I
don't fully grok.

Rolling back to unblock CI, we should make sure hotrestart_test passes
TSAN with high confidence before rolling forward again.

Signed-off-by: Harvey Tuch <htuch@google.com>
jmarantz added a commit to jmarantz/envoy that referenced this pull request Feb 15, 2018
…method (envoyproxy#2568)" (envoyproxy#2613)"

This reverts commit 034135f.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
dnoe pushed a commit that referenced this pull request Feb 15, 2018
…ation (#2623)

Description:
Reverts #2613 the revert of #2568
Fixed the underlying issue which was that logging was not initialized during validation, so it ran lockless. This caused intermittent tsan errors. Although #2619 makes the failure immediate and consistent by asserting that logging has been initialized prior to spawning any threads.

Risk Level: Medium -- the earlier #2568 caused intermittent tsan errors which we believe to be fixed, but #2568 was medium-risk in the first place.

Release Notes: N/A
lita pushed a commit to lita/envoy that referenced this pull request Feb 15, 2018
Shikugawa pushed a commit to Shikugawa/envoy that referenced this pull request Mar 28, 2020
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: Converts EM's responseFlags to Cronet's API Errors
Additional Description: Note that this change does not include ERR_NETWORK_CHANGED, ERR_INTERNET_DISCONNECT and QUIC_PROTOCOL_FAILED as they aren't implemented yet
Risk Level: Low
Testing: Test included
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Chidera Olibie <colibie@google.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: Converts EM's responseFlags to Cronet's API Errors
Additional Description: Note that this change does not include ERR_NETWORK_CHANGED, ERR_INTERNET_DISCONNECT and QUIC_PROTOCOL_FAILED as they aren't implemented yet
Risk Level: Low
Testing: Test included
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Chidera Olibie <colibie@google.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants