Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket_manager: add feature to take over another server #146

Conversation

daipom
Copy link
Contributor

@daipom daipom commented Aug 30, 2024

Another process can take over UDP/TCP sockets without downtime.

server = ServerEngine::SocketManager::Server.take_over_another_server(path)

This starts a new server that has all UDP/TCP sockets of the existing server.
It receives the sockets from the existing server and stops it after starts a new server.

This may not be the primary use case assumed by ServerEngine, but we need this feature to replace both the server and the workers with a new process without downtime.
Currently, ServerEngine does not provide this feature for network servers.

At the moment, I assume that the application side uses this feature ad hoc, but, in the future, this could be used to support live reload for entire network servers.

@daipom
Copy link
Contributor Author

daipom commented Aug 30, 2024

TODO add tests.

@daipom daipom force-pushed the add-feature-to-take-over-another-server branch from 9f44eb8 to 9d45e17 Compare August 30, 2024 06:52
daipom added a commit to fluent/fluentd that referenced this pull request Aug 30, 2024
Add a new feature: Update/Reload without downtime.

1. The current supervisor receives a signal.
2. The current supervisor sends signals to its workers, and the
   workers stop all plugins that cannot run in parallel.
3. The current supervisor starts a new supervisor.
   * => Old processes and new processes run in parallel.
4. After the new supervisor and its workers start to work, the
   current supervisor and its workers stop.

ref: nginx's feature for upgrading on the fly

* http://nginx.org/en/docs/control.html#upgrade

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Specific feature:

Run only limited Input plugins in parallel, such as `in_tcp`,
`in_udp`, and `in_syslog`.
Stop all plugins except those Input plugins, and prepare a
dedicated file buffer for Output.
After the new workers start, they load the file buffer and route
those events to the ROOT label.

Note: need treasure-data/serverengine#146

Signed-off-by: Daijiro Fukuda <[email protected]>
Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.take_over_another_server(path)

This starts a new server that has all UDP/TCP sockets of the
existing server.
It receives the sockets from the existing server and stops it
after starts a new server.

This may not be the primary use case assumed by ServerEngine, but
we need this feature to replace both the server and the workers
with a new process without downtime.
Currently, ServerEngine does not provide this feature for
network servers.

At the moment, I assume that the application side uses this
feature ad hoc, but, in the future, this could be used to support
live reload for entire network servers.

ref: fluent/fluentd#4622

Signed-off-by: Daijiro Fukuda <[email protected]>
@ashie ashie force-pushed the add-feature-to-take-over-another-server branch from 9d45e17 to 4a5b1a4 Compare September 3, 2024 14:29
@ashie
Copy link
Collaborator

ashie commented Sep 3, 2024

I've rebased this on top of current master branch.

daipom added a commit to daipom/fluentd that referenced this pull request Oct 3, 2024
Add a new feature: Update/Reload without downtime.

1. The current supervisor receives a signal.
2. The current supervisor sends signals to its workers, and the
   workers stop all plugins that cannot run in parallel.
3. The current supervisor starts a new supervisor.
   * => Old processes and new processes run in parallel.
4. After the new supervisor and its workers start to work, the
   current supervisor and its workers stop.

ref: nginx's feature for upgrading on the fly

* http://nginx.org/en/docs/control.html#upgrade

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Specific feature:

Run only limited Input plugins in parallel, such as `in_tcp`,
`in_udp`, and `in_syslog`.
Stop all plugins except those Input plugins, and prepare an
agent for forwarding data to the new workers.
After the new workers start, they receive events from the old
workers.

Note: need treasure-data/serverengine#146

Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Oct 3, 2024
Add a new feature: Update/Reload without downtime.

1. The current supervisor receives a signal.
2. The current supervisor sends signals to its workers, and the
   workers stop all plugins that cannot run in parallel.
3. The current supervisor starts a new supervisor.
   * => Old processes and new processes run in parallel.
4. After the new supervisor and its workers start to work, the
   current supervisor and its workers stop.

ref: nginx's feature for upgrading on the fly

* http://nginx.org/en/docs/control.html#upgrade

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Specific feature:

Run only limited Input plugins in parallel, such as `in_tcp`,
`in_udp`, and `in_syslog`.
Stop all plugins except those Input plugins, and prepare an
agent for forwarding data to the new workers.
After the new workers start, they receive events from the old
workers.

Note: need treasure-data/serverengine#146

Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Oct 11, 2024
Add a new feature: Update/Reload without downtime.

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to restart_without_downtime_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGRTMIN(34) to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Oct 11, 2024
Add a new feature: Update/Reload without downtime.

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to restart_without_downtime_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGRTMIN(34) to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
@daipom
Copy link
Contributor Author

daipom commented Oct 15, 2024

Sorry, I have made a new PR for this to change the branch.

@daipom daipom closed this Oct 15, 2024
daipom added a commit to fluent/fluentd that referenced this pull request Oct 31, 2024
Add a new feature: Update/Reload without downtime.

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to restart_without_downtime_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGRTMIN(34) to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 19, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to restart_without_downtime_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 19, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 21, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 21, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 22, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 22, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 22, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 25, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 25, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 25, 2024
Add a new feature: Zero downtime update/reload

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Problem to solve:

Updating Fluentd or reloading a config causes downtime.
Plugins that receive data as a server, such as `in_udp`, `in_tcp`,
and `in_syslog`, cannot receive data during this time.
This means that the data sent by a client is lost during this
time unless the client has a re-sending feature.
This makes updating Fluentd or reloading a config difficult in
some cases.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to daipom/fluentd that referenced this pull request Nov 25, 2024
This replaces the current `GracefulReload` (`SIGUSR2`) (fluent#2716)
with the new feature on non-Windows:

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/config.gracefulReload`
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (fluent#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see fluent#4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* fluent#4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional GracefulReload:

* The traditional GracefulReload feature has some limitations
  and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * fluent#2259
    * fluent#3469
    * fluent#3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 25, 2024
This replaces the current `GracefulReload` (`SIGUSR2`) (#2716)
with the new feature on non-Windows:

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/config.gracefulReload`
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional GracefulReload:

* The traditional GracefulReload feature has some limitations
  and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 26, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Co-authored-by: Kentaro Hayashi <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Co-authored-by: Kentaro Hayashi <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 27, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Co-authored-by: Shizuo Fujita <[email protected]>
Co-authored-by: Kentaro Hayashi <[email protected]>
Signed-off-by: Daijiro Fukuda <[email protected]>
daipom added a commit to fluent/fluentd that referenced this pull request Nov 28, 2024
This replaces the current `SIGUSR2` (#2716) with the new feature.
(Not supported on Windows).

* Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd
without data loss of plugins such as `in_udp`.

Specification:

* 2 ways to trigger this feature (non-Windows):
  * Signal: `SIGUSR2` to the supervisor.
    * Sending `SIGUSR2` to the workers triggers the traditional
      GracefulReload.
      * (Leave the traditional way, just in case)
  * RPC: `/api/processes.zeroDowntimeRestart`
    * Leave `/api/config.gracefulReload` for the traditional feature.
* This starts the new supervisor and workers with zero downtime
  for some plugins.
  * Input plugins with `zero_downtime_restart` supported work in
    parallel.
    * Supported input plugins:
      * `in_tcp`
      * `in_udp`
      * `in_syslog`
  * The old processes stop after 10s.
* The new supervisor works in `source-only` mode (#4661)
  until the old processes stop.
  * After the old processes stop, the data handled by the new
    processes are loaded and processed.
  * If need, you can configure `source_only_buffer` (see #4661).
* Windows: Not affected at all. Remains the traditional
  GracefulReload.

Mechanism:

1. The supervisor receives SIGUSR2.
2. Spawn a new supervisor.
3. Take over shared sockets.
4. Launch new workers, and stop old processes in parallel.
   * Launch new workers with source-only mode
     * Limit to zero_downtime_restart_ready? input plugin
   * Send SIGTERM to the old supervisor after 10s delay from 3.
5. The old supervisor stops and sends SIGWINCH to the new one.
6. The new workers run fully.

Note: need these feature

* #4661
* treasure-data/serverengine#146

Conditions under which `zero_downtime_restart_ready?` can be enabled:

* Must be able to work in parallel with another Fluentd instance.
* Notes:
  * The sockets provided by server helper are shared with the
    new Fluentd instance.
  * Input plugins managing a position such as `in_tail` should
    not enable its `zero_downtime_restart_ready?`.
    * Such input plugins do not cause data loss on restart, so
      there is no need to enable this in the first place.
  * `in_http` and `in_forward` could also be supported.
    Not supporting them this time is simply a matter of time to
    consider.

The appropriateness of replacing the traditional SIGUSR2:

* The traditional SIGUSR2 feature has some limitations and issues.
  * Limitations:
    1. A change to system_config is ignored because it needs to
       restart(kill/spawn) process.
    2. All plugins must not use class variable when restarting.
  * Issues:
    * #2259
    * #3469
    * #3549
* This new feature allows restarts without downtime and such
  limitations.
  * Although supported plugins are limited, that is not a
    problem for many plugins.
    (The problem is with server-based input plugins where the
    stop results in data loss).
* This new feature has a big advantage that it can also be used
  to update Fluentd.
  * In the future, fluent-package will use this feature to allow
    update with zero downtime by default.
* If needed, we can still use the traditional feature by RPC or
  directly sending `SIGUSR2` to the workers.

Signed-off-by: Daijiro Fukuda <[email protected]>
Co-authored-by: Shizuo Fujita <[email protected]>
Co-authored-by: Kentaro Hayashi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants