-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interval to restart worker #120
Add interval to restart worker #120
Conversation
Some tests fails on Windows with Ruby3 (3.0.3, 3.1). These are probably the same causes. Example (on Windows with Ruby 3.0.3):
It appears that But I have no idea what caused that. |
This fix needs the fix of ServerEngine: * treasure-data/serverengine#120 Signed-off-by: Daijiro Fukuda <[email protected]>
@daipom OK. I spent this evening testing this patch. I could confirm Here is some snippets from my testing. # Multi-Process server code retrieved from README.md
# with workers=2 + restart_worker_interval=5.
#
# Process 18222 got killed at 17:16:28, and start running
# again at 17:16:33 (as 18223).
$ bundle exec test-server.rb
I, [2022-05-31T17:16:27.304833 #18229] INFO -- : Awesome work!
I, [2022-05-31T17:16:28.304004 #18226] INFO -- : Awesome work!
E, [2022-05-31T17:16:28.306409 #18222] ERROR -- : Worker 1 finished unexpectedly with status 0
I, [2022-05-31T17:16:29.304818 #18226] INFO -- : Awesome work!
I, [2022-05-31T17:16:30.305964 #18226] INFO -- : Awesome work!
I, [2022-05-31T17:16:31.307159 #18226] INFO -- : Awesome work!
I, [2022-05-31T17:16:32.308340 #18226] INFO -- : Awesome work!
I, [2022-05-31T17:16:33.309509 #18226] INFO -- : Awesome work!
I, [2022-05-31T17:16:33.804000 #18233] INFO -- : Awesome work! That said, the worrying part of the current patch is that it seems $ git fetch https://github.com/daipom/serverengine add-interval-restarting-worker
From https://github.com/daipom/serverengine
* branch add-interval-restarting-worker -> FETCH_HEAD
$ git diff FETCH_HEAD --stat lib/
lib/serverengine/multi_process_server.rb | 9 ++----
lib/serverengine/multi_thread_server.rb | 7 +----
lib/serverengine/multi_worker_server.rb | 93 +++++----------------------------------------------------
3 files changed, 12 insertions(+), 97 deletions(-) That's not necessary a bad thing, but I do feel that the change Here's a simplified version of the patch. It works as follows:
In that way, it works just as the current version, except it only lib/serverengine/multi_process_server.rb | 2 ++
lib/serverengine/multi_thread_server.rb | 3 +++
lib/serverengine/multi_worker_server.rb | 15 ++++++++++++++-
3 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/lib/serverengine/multi_process_server.rb b/lib/serverengine/multi_process_server.rb
index 19cd72a73842..41761b4efbb1 100644
--- a/lib/serverengine/multi_process_server.rb
+++ b/lib/serverengine/multi_process_server.rb
@@ -105,9 +105,11 @@ module ServerEngine
@unrecoverable_exit_codes = unrecoverable_exit_codes
@unrecoverable_exit = false
@exitstatus = nil
+ @restart_at = nil
end
attr_reader :exitstatus
+ attr_accessor :restart_at
def send_stop(stop_graceful)
@stop = true
diff --git a/lib/serverengine/multi_thread_server.rb b/lib/serverengine/multi_thread_server.rb
index 0b3e2d121619..6615937d8f9d 100644
--- a/lib/serverengine/multi_thread_server.rb
+++ b/lib/serverengine/multi_thread_server.rb
@@ -39,8 +39,11 @@ module ServerEngine
def initialize(worker, thread)
@worker = worker
@thread = thread
+ @restart_at = nil
end
+ attr_accessor :restart_at
+
def send_stop(stop_graceful)
Thread.new do
begin
diff --git a/lib/serverengine/multi_worker_server.rb b/lib/serverengine/multi_worker_server.rb
index 60b650fe1196..d8abacaf267c 100644
--- a/lib/serverengine/multi_worker_server.rb
+++ b/lib/serverengine/multi_worker_server.rb
@@ -85,6 +85,7 @@ module ServerEngine
@start_worker_delay = @config[:start_worker_delay] || 0
@start_worker_delay_rand = @config[:start_worker_delay_rand] || 0.2
+ @restart_worker_interval = @config[:restart_worker_interval] || 0
scale_workers(@config[:workers] || 1)
@@ -116,7 +117,11 @@ module ServerEngine
elsif wid < @num_workers
# scale up or reboot
unless @stop
- @monitors[wid] = delayed_start_worker(wid)
+ if m and @restart_worker_interval > 0
+ restart_worker(m, wid)
+ else
+ @monitors[wid] = delayed_start_worker(wid)
+ end
num_alive += 1
end
@@ -129,6 +134,14 @@ module ServerEngine
return num_alive
end
+ def restart_worker(m, wid)
+ if m.restart_at.nil?
+ m.restart_at = Time.now() + @restart_worker_interval
+ elsif m.restart_at <= Time.now()
+ @monitors[wid] = start_worker(wid)
+ end
+ end
+
def delayed_start_worker(wid)
if @start_worker_delay > 0
delay = @start_worker_delay + |
@fujimotos Thank you for the feedback! I'd like to care about maintainability, but there certainly seems to be some unnecessary and large modifications! I'll try to remove it first. |
83b450f
to
bbbd8b7
Compare
How about this, trying to keep both the small diff and the clear logic? % git diff FETCH_HEAD --stat lib/
lib/serverengine/multi_process_server.rb | 2 ++
lib/serverengine/multi_thread_server.rb | 3 +++
lib/serverengine/multi_worker_server.rb | 41 +++++++++++++++++++++++++++++++++--------
3 files changed, 38 insertions(+), 8 deletions(-) |
I changed the test code a little to avoid test failures at 5df9c68, |
Hmm, still a test failed on Ruby 3.0 and 3.1:
|
For some reason, tests unrelated to this fix no longer fail.. |
OK, This is much better! I think this is comittable once the
One possible further improvement is the configuration naming. Since we already have this option called But this is just an casual observation, and I don't have a dog in it.
Talking about the test failure, it appears that something is wrong sleep(start_worker_delay * (workers - 1))
monitors.count { |m| m.alive? }.should == workers My psychic guess is that it's probably because we do not take |
This could be the cause.
|
It seems that the worker always stops after about 26 seconds in these tests. Example:
@@ -17,6 +17,9 @@ describe ServerEngine::MultiSpawnServer do
Timeout.timeout(5) do
sleep(0.5) until test_state(:worker_run) == 2
end
+
+ sleep(24)
+
test_state(:worker_run).should == 2
ensure % bundle exec rspec spec\multi_spawn_server_spec.rb -e "starts worker processes"
Run options: include {:full_description=>/starts\ worker\ processes/}
I, [2022-06-01T17:00:56.481793 #18280] INFO -- : Received graceful stop
I, [2022-06-01T17:00:56.719072 #18280] INFO -- : Worker 0 finished with status 0
I, [2022-06-01T17:00:56.719695 #18280] INFO -- : Worker 1 finished with status 0
.
Finished in 25.76 seconds
1 example, 0 failures
@@ -17,6 +17,9 @@ describe ServerEngine::MultiSpawnServer do
Timeout.timeout(5) do
sleep(0.5) until test_state(:worker_run) == 2
end
+
+ sleep(28)
+
test_state(:worker_run).should == 2
ensure % bundle exec rspec spec\multi_spawn_server_spec.rb -e "starts worker processes"
Run options: include {:full_description=>/starts\ worker\ processes/}
E, [2022-06-01T17:01:55.061765 #17836] ERROR -- : Worker 0 finished unexpectedly with status 0
E, [2022-06-01T17:01:55.081210 #17836] ERROR -- : Worker 1 finished unexpectedly with status 0
I, [2022-06-01T17:01:58.352816 #17836] INFO -- : Received graceful stop
I, [2022-06-01T17:01:58.631026 #17836] INFO -- : Worker 0 finished with status 0
I, [2022-06-01T17:01:58.631597 #17836] INFO -- : Worker 1 finished with status 0
F
Failures:
1) ServerEngine::MultiSpawnServer starts worker processes with command_sender=pipe
Failure/Error: test_state(:worker_run).should == 2
expected: 2
got: 4 (using ==)
# ./spec/multi_spawn_server_spec.rb:24:in `block (4 levels) in <top (required)>'
Finished in 29.79 seconds
1 example, 1 failure |
|
Seems to be intentional by this code. def run
incr_test_state :worker_run
5.times do
# repeats 5 times because signal handlers
# interrupts wait
@stop_flag.wait(5.0)
end
@stop_flag.reset!
end |
3a013cf
to
feeef16
Compare
I think this fix will cure the test, so could you please run the test again?
That bothered me too. I thought it would be less confusing to distinguish between However, if this seems rather more confusing, it would be better to fix it. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm fine with this patch.
The patch itself seems good, but it makes whole test time over 2 times longer than before. |
I think the new test sleeps for more than a minute. |
I have added the fix to reduce test time. This will reduce the test time by about one minute. |
Fails on Windows again.. :( |
Hmm, still unstable on Windows: https://github.com/treasure-data/serverengine/actions/runs/2459932320 |
Don't worry, tuning test timing on Windows is hard, I also often get trouble on it. |
Thank you. |
4cc76de
to
a25ab41
Compare
I've rebased this on top of master branch. |
Thank you. Unit testing with Ruby 2.7 on ubuntu-latest
This shows that it can take more than 5 seconds to start 3 workers on Ubuntu as well as Windows. Unit testing with Ruby 2.7 on windows-latest
This error is completely unexpected. Unit testing with Ruby 3.1 on windows-latest Some tests that have nothing to do with this fix is even failing... |
Hmm, they cannot reproduce on my local environment, all tests are passed... |
|
Thank you! That is certainly true! But I don't know why this error occurs. |
Signed-off-by: Daijiro Fukuda <[email protected]>
Added a new option: `restart_worker_interval`. This option can work with the existing option `start_worker_delay`. Signed-off-by: Daijiro Fukuda <[email protected]>
This explanation was confusing with the new option `restart_worker_interval`, so I made the difference clear. Note: The behavior of `start_worker_delay` option doesn't change, this just makes the explanation clearer. Signed-off-by: Daijiro Fukuda <[email protected]>
This fix cause workers to remain stopped for a while, although they are scheduled to restart later. Workers in that state should be included in this count because it should not break if there are workers scheduled to restart. Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
Some new tests needs the longer timeout. Signed-off-by: Daijiro Fukuda <[email protected]>
Use `timecop` and shorten some delay options to reduce test time. `$ bundle exec rspec spec/multi_spawn_server_spec.rb -e "keepalive_workers"` - before: 1m51.6s - after: 41.6s Signed-off-by: Daijiro Fukuda <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
a25ab41
to
bf88f9a
Compare
Signed-off-by: Takuro Ashie <[email protected]>
So I merge this. |
Thank you so much for the fix and the merge. I wonder why this solves the timeout of restarting,
Sure! Thank you! |
This fix needs the fix of ServerEngine: * treasure-data/serverengine#120 Signed-off-by: Daijiro Fukuda <[email protected]>
This fix needs the fix of ServerEngine: * treasure-data/serverengine#120 Signed-off-by: Daijiro Fukuda <[email protected]>
Issue: fluent/fluentd#3749
This fix adds a new option
restart_worker_interval
for multithread/multiprocess server.This option prevents workers from being restarted immediately after being killed,
and this is useful if you want workers to stop for a while.
This fix is compatible with the existing option:
start_worker_delay
,which prevents multiple workers from starting/restarting at the same time.
TODO