Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

box: finish client fibers on shutdown #9604

Merged
merged 1 commit into from
Jan 29, 2024

Conversation

nshy
Copy link
Contributor

@nshy nshy commented Jan 19, 2024

In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers shutdown:

  • make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such).

  • make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown.

  • some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly.

  • some tests (8530, 5430, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead.

  • we change test in vinyl/errinj for gh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test.

  • other similar changes

Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown.

Part of ##8423

@nshy nshy requested a review from a team as a code owner January 19, 2024 11:43
@coveralls
Copy link

coveralls commented Jan 19, 2024

Coverage Status

coverage: 86.927% (+0.003%) from 86.924%
when pulling ef70297 on nshy:client-fibers-shutdown
into 3fcd83e
on tarantool:master
.

@nshy nshy requested a review from locker January 19, 2024 13:55
src/box/box.cc Show resolved Hide resolved
src/box/iproto.cc Outdated Show resolved Hide resolved
src/box/lua/console.lua Outdated Show resolved Hide resolved
src/lib/core/fiber.c Show resolved Hide resolved
test/replication/gh-5430-cluster-mvcc.result Outdated Show resolved Hide resolved
test/vinyl/errinj_vylog.test.lua Show resolved Hide resolved
@nshy
Copy link
Contributor Author

nshy commented Jan 24, 2024

I also added this hunk to fix memleak with injections.

--- a/src/box/memtx_engine.cc
+++ b/src/box/memtx_engine.cc
@@ -1028,7 +1028,7 @@ checkpoint_f(va_list ap)
                return -1;
        }

-       struct mh_i32_t *temp_space_ids = mh_i32_new();
+       struct mh_i32_t *temp_space_ids;

        say_info("saving snapshot `%s'", snap->filename);
        ERROR_INJECT_WHILE(ERRINJ_SNAP_WRITE_DELAY, {
@@ -1040,6 +1040,7 @@ checkpoint_f(va_list ap)
        });
        ERROR_INJECT(ERRINJ_SNAP_SKIP_ALL_ROWS, goto done);
        struct space_read_view *space_rv;
+       temp_space_ids = mh_i32_new();
        read_view_foreach_space(space_rv, &ckpt->rv) {
                FiberGCChecker gc_check;
                bool skip = false;

@nshy nshy assigned locker and unassigned nshy Jan 24, 2024
src/box/lua/console.lua Outdated Show resolved Hide resolved
src/box/replication.cc Show resolved Hide resolved
src/box/vy_scheduler.c Outdated Show resolved Hide resolved
src/lib/core/fiber.h Outdated Show resolved Hide resolved
src/lib/core/fiber.h Outdated Show resolved Hide resolved
src/lib/core/fiber_pool.c Show resolved Hide resolved
@locker locker assigned nshy and unassigned locker Jan 25, 2024
@nshy nshy assigned locker and unassigned nshy Jan 26, 2024
@locker locker added the full-ci Enables all tests for a pull request label Jan 26, 2024
@locker locker assigned nshy and unassigned locker Jan 26, 2024
In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for tarantoolgh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
@nshy nshy removed the full-ci Enables all tests for a pull request label Jan 26, 2024
@nshy nshy added the full-ci Enables all tests for a pull request label Jan 26, 2024
nshy added a commit to nshy/queue that referenced this pull request Jan 29, 2024
We are going to finish all client (non-system) fibers in the process
of Tarantool shutdown. First we cancel them and then wait for their
finishing. So if the client fibers are not finished Tarantool shutdown
is never finished too.

Currently some of queue fibers can not be finished and we got hang on
integration testing of PR [1]. Let's fix it.

In `queue_state.lua` the only error that can aries on `box.ctl.wait_rw` or
`box.ctl.wait_ro` is `FiberIsCancelled` so we can just drop pcall.

In `utubettl.lua` we just need to fix a typo. Just as already done for
fifottl in  e355387 ("fix fifottl_fiber_iteration error handling")

[1] PR with client fibers finishing on shutdown.
tarantool/tarantool#9604
nshy added a commit to nshy/memcached that referenced this pull request Jan 29, 2024
We are going to finish all client (non-system) fibers in the process
of Tarantool shutdown. First we cancel them and then wait for their
finishing. So if the client fibers are not finished Tarantool shutdown
is never finished too.

Currently memcached service fiber can not be finished and we got hang on
integration testing of PR [1]. Let's fix it.

[1] PR with client fibers finishing on shutdown.
tarantool/tarantool#9604
nshy added a commit to nshy/memcached that referenced this pull request Jan 29, 2024
We are going to finish all client (non-system) fibers in the process
of Tarantool shutdown. First we cancel them and then wait for their
finishing. So if the client fibers are not finished Tarantool shutdown
is never finished too.

Currently memcached service fiber can not be finished and we got hang on
integration testing of PR [1]. Let's fix it.

[1] PR with client fibers finishing on shutdown.
tarantool/tarantool#9604
LeonidVas pushed a commit to tarantool/memcached that referenced this pull request Jan 29, 2024
We are going to finish all client (non-system) fibers in the process
of Tarantool shutdown. First we cancel them and then wait for their
finishing. So if the client fibers are not finished Tarantool shutdown
is never finished too.

Currently memcached service fiber can not be finished and we got hang on
integration testing of PR [1]. Let's fix it.

[1] PR with client fibers finishing on shutdown.
tarantool/tarantool#9604
nshy added a commit to nshy/queue that referenced this pull request Jan 29, 2024
We are going to finish all client (non-system) fibers in the process of
Tarantool shutdown. First we cancel them and then wait for their
finishing. So if the client fibers are not finished Tarantool shutdown
is never finished too.

Currently the fiber can not be finished and we got hang in integration
testing of PR tarantool/tarantool#9604.
nshy added a commit to nshy/queue that referenced this pull request Jan 29, 2024
Same as fixing a typo in e355387 ("fix fifottl_fiber_iteration error
handling") for fifottl.

Required to pass integration testing of PR tarantool/tarantool#9604.
nshy added a commit to nshy/queue that referenced this pull request Jan 29, 2024
We are going to finish all client (non-system) fibers in the process of
Tarantool shutdown. First we cancel them and then wait for their
finishing. So if the client fibers are not finished Tarantool shutdown
is never finished too.

Currently the fiber can not be finished and we got hang in integration
testing of PR tarantool/tarantool#9604.
nshy added a commit to nshy/queue that referenced this pull request Jan 29, 2024
Same as fixing a typo in e355387 ("fix fifottl_fiber_iteration error
handling") for fifottl.

Required to pass integration testing of PR tarantool/tarantool#9604.
LeonidVas pushed a commit to tarantool/queue that referenced this pull request Jan 29, 2024
We are going to finish all client (non-system) fibers in the process of
Tarantool shutdown. First we cancel them and then wait for their
finishing. So if the client fibers are not finished Tarantool shutdown
is never finished too.

Currently the fiber can not be finished and we got hang in integration
testing of PR tarantool/tarantool#9604.
LeonidVas pushed a commit to tarantool/queue that referenced this pull request Jan 29, 2024
Same as fixing a typo in e355387 ("fix fifottl_fiber_iteration error
handling") for fifottl.

Required to pass integration testing of PR tarantool/tarantool#9604.
@nshy nshy assigned locker and unassigned nshy Jan 29, 2024
@locker locker merged commit bf62065 into tarantool:master Jan 29, 2024
91 checks passed
@nshy nshy deleted the client-fibers-shutdown branch January 29, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-ci Enables all tests for a pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants