server: wait workers to start before draining parent.#14319
server: wait workers to start before draining parent.#14319mattklein123 merged 15 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Tong Cai <caitong93@gmail.com>
|
Trying to fix. It seems more complex than I thought, and i will spent sometime to figure out the test logic of hot restart. |
|
At a high level this looks correct, so let me know if you have any questions or want me to do a further review. /wait |
Signed-off-by: Tong Cai <caitong93@gmail.com>
Signed-off-by: Tong Cai <caitong93@gmail.com>
Signed-off-by: Tong Cai <caitong93@gmail.com>
|
Basically ready for review. |
|
The new stage |
Signed-off-by: Tong Cai <caitong93@gmail.com>
|
Updated. Tests pass. |
Signed-off-by: Tong Cai <caitong93@gmail.com>
| started.WaitForNotification(); | ||
| EXPECT_TRUE(startup); | ||
| EXPECT_FALSE(shutdown); | ||
| EXPECT_TRUE(TestUtility::findGauge(stats_store_, "server.state")->used()); |
There was a problem hiding this comment.
Remove server.state check here because it's non deterministic.
mattklein123
left a comment
There was a problem hiding this comment.
At a high level this looks mostly correct but I'm confused about why some of the changes were made. Thank you!
/wait
source/server/server.h
Outdated
| // startup_ is true means Startup notifications have been called. | ||
| bool startup_{}; |
There was a problem hiding this comment.
Move this up into the variables section. Also startup_lifecycle_event_raised_ or something like that?
source/server/server.cc
Outdated
| if (!startup_) { | ||
| notifyCallbacksForStage(Stage::Startup); | ||
| startup_ = true; | ||
| } |
There was a problem hiding this comment.
Can you explain these changes? It's not clear to my why they were made. Please add more comments both here and below if they are necessary.
There was a problem hiding this comment.
This ensures Startup notifications to be sent first. Otherwise at LifecycleNotifications test , the notification order will be PostInit, WorkerStarted, Startup(because in static configuration, post_init_cb will be called immediately , before main thread dispatcher start), and deadlock will happen.(because we block callback at WorkerStarted stage)
| /** | ||
| * All workers have started. | ||
| */ | ||
| WorkerStarted, |
There was a problem hiding this comment.
This is not used? If this is needed WorkersStarted
There was a problem hiding this comment.
I see. My preference would be to not make prod changes like this just for tests. If you need synchronization hooks can you use https://github.com/envoyproxy/envoy/blob/master/source/server/listener_hooks.h instead? Thank you.
/wait
| std::unique_ptr<Network::MockConnectionSocket> socket_; | ||
| uint64_t listener_tag_{1}; | ||
| bool enable_dispatcher_stats_{false}; | ||
| std::function<void()> callback_; |
There was a problem hiding this comment.
This should be a mock and actually verify correct calls.
Signed-off-by: Tong Cai <caitong93@gmail.com>
Signed-off-by: Tong Cai <caitong93@gmail.com>
Signed-off-by: Tong Cai <caitong93@gmail.com>
Signed-off-by: Tong Cai <caitong93@gmail.com>
Signed-off-by: Tong Cai <caitong93@gmail.com>
|
Please merge main and check format. /wait |
Signed-off-by: Tong Cai <caitong93@gmail.com>
Signed-off-by: Tong Cai <caitong93@gmail.com>
|
@mattklein123 Updated, PTAL. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks this LGTM. Just a few test questions.
/wait-any
| server_thread->join(); | ||
| } | ||
|
|
||
| TEST_P(ServerInstanceImplTest, DrainParentListenerAfterWorkersStarted) { |
There was a problem hiding this comment.
Can you run this with --runs_per_test=1000 to make sure it doesn't flake?
There was a problem hiding this comment.
Done.
Target //test/server:server_test up-to-date:
bazel-bin/test/server/server_test
INFO: Elapsed time: 4724.860s, Critical Path: 22.74s
INFO: 1001 processes: 1 internal, 1000 processwrapper-sandbox.
INFO: Build completed successfully, 1001 total actions
//test/server:server_test PASSED in 21.8s
Stats over 1000 runs: max = 21.8s, min = 18.5s, avg = 18.8s, dev = 0.3s
| if (isShutdown()) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Do you have test coverage of this case? (Put an ASSERT in there and see if it's hit or look at coverage report)
There was a problem hiding this comment.
Added test for this in the new commit.
Signed-off-by: Tong Cai <caitong93@gmail.com>
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
|
Seems not like a network problem, will check what cause CI to fail when I got time. |
|
It's an unrelated OSX issue. Will just merge. |
|
Thanks for the review! |
* master: (30 commits) Deflaked: Guarddog_impl_test (envoyproxy#14475) [fuzz] add fuzz tests for hpack encoding and decoding (envoyproxy#13315) [filters] Prevent a filter from sending local reply and continue (envoyproxy#14416) oauth2: improving coverage (envoyproxy#14479) owners: Change dio email address (envoyproxy#14498) macos build: Fix ninja install (envoyproxy#14495) http: use OptRef helper to reduce some boilerplate (envoyproxy#14361) doc: update test/integration/README.md (envoyproxy#14485) server: wait workers to start before draining parent. (envoyproxy#14319) api: relax inline_string length limitation in DataSource (envoyproxy#14461) oauth: properly stop filter chain when a response was sent (envoyproxy#14476) listener: deprecate use_proxy_proto (envoyproxy#14406) deps: update cel and remove a patch (envoyproxy#14473) preconnect: rename: (envoyproxy#14474) coverage: ratcheting limits (envoyproxy#14472) grpc mux: fix sending node again after stream is reset (envoyproxy#14080) [test] Replace printers_include with printers_lib. (envoyproxy#14442) tcp: nodelay in the new pool (envoyproxy#14453) test: replace mock_methodn macros with mock_method (envoyproxy#14450) tcp: extending tcp integration test (envoyproxy#14451) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Tong Cai caitong93@gmail.com
Commit Message: server: wait workers to start before draining parent.
Additional Description:
Manual test pass. Inject 5s delay in
socket()call, keep sending traffic to Envoy during hot restarts. Everything seems good.Risk Level: medium
Testing:
Docs Changes:
Release Notes:
[Optional Fixes #Issue]
Fixes #14295