The reload configuration with the worker tag failed #3469

xidiandb · 2021-07-26T08:54:43Z

Describe the bug

I used a configuration with the worker tag, which runs on startup but not on reload.

To Reproduce

Use my configuration to start and reload

Expected behavior

fluent/log.rb:371:error: Failed to reload config file: specified worker_id<0> collisions is detected on directive. Available worker id(s): []

Your Environment

- Fluentd version: 1.11.4
- TD Agent version:
- Operating system: ubuntu 18.04
- Kernel version:5.4.61-050461-generic

Your Configuration

<system>
  workers 1
  rpc_endpoint "#{ENV['POD_IP']}:24444"
</system>
<label @FLUENT_LOG>
<match fluent.*>
  @type null
</match>
</label>   
<worker 0>
<source>
  @type sample
  sample {"hello borg ooo":"world"}
  rate 1
  tag sample.ni.hao
</source>
</worker>
<worker 0>
<match sample.*.*>
  @type stdout
</match>
</worker>

Your Error Log

fluent/log.rb:371:error: Failed to reload config file: specified worker_id<0> collisions is detected on <worker> directive. Available worker id(s): []

Additional context

No response

The text was updated successfully, but these errors were encountered:

kenhys · 2021-07-26T09:37:42Z

Use the following, do not define multiple <worker 0>.

<system>
  workers 1
  rpc_endpoint "#{ENV['POD_IP']}:24444"
</system>
<label @FLUENT_LOG>
  <match fluent.*>
    @type null
  </match>
</label>   
<worker 0>
  <source>
    @type sample
    sample {"hello borg ooo":"world"}
    rate 1
    tag sample.ni.hao
  </source>
  <match sample.*.*>
    @type stdout
  </match>
</worker>

xidiandb · 2021-07-26T09:53:15Z

@kenhys But the first time it started, there was no problem, only on reload ，Moreover, my configuration is very complex, which is divided into multiple files, and some configurations cannot be written to a worker tag. I would like to know if this boot is ok and if it is a bug when it reloads

xidiandb · 2021-07-26T09:57:43Z

I want to add configuration dynamically by adding files, each with worker tags. Is that not supportive? But why is there no problem at startup, only on reload

kenhys · 2021-07-27T07:51:24Z

Hmm, I've overlooked it.

kenhys · 2021-07-27T08:49:09Z

https://github.com/fluent/fluentd/blob/master/lib/fluent/supervisor.rb#L290-L303
It seems that Fluent::Engine.reload_config raise it.

ashie · 2021-10-07T09:32:53Z

I cannot reproduce it by HUP signal, but can reproduce by USR2 signal.

ashie · 2021-10-07T09:44:07Z

There are 2 places which show such message:

fluentd/lib/fluent/static_config_analysis.rb

Line 81 in 1b46fe0

    
           raise Fluent::ConfigError, "specified worker_id<#{id}> collisions is detected on <worker> directive. Available worker id(s): #{available_worker_ids}"

fluentd/lib/fluent/root_agent.rb

Line 95 in 1b46fe0

    
           raise Fluent::ConfigError, "specified worker_id<#{worker_id}> collisions is detected on <worker> directive. Available worker id(s): #{available_worker_ids}"

and former one is used in this case.

ashie · 2021-10-15T09:06:16Z

Fluent::StaticConfigAnalysis is used only on graceful-reload.
It's added for implementing graceful-reload feature.

ashie · 2021-10-15T09:44:37Z

<worker> elements are parsed by the following code on start up.

fluentd/lib/fluent/root_agent.rb

Lines 69 to 124 in 8f990b8

    
           used_worker_ids = [] 
        
           available_worker_ids = (0..Fluent::Engine.system_config.workers - 1).to_a 
        
           # initialize <worker> elements 
        
           conf.elements(name: 'worker').each do |e| 
        
             target_worker_id_str = e.arg 
        
             if target_worker_id_str.empty? 
        
               raise Fluent::ConfigError, "Missing worker id on <worker> directive" 
        
             end 
        
             target_worker_ids = target_worker_id_str.split("-") 
        
             if target_worker_ids.size == 2 
        
               first_worker_id = target_worker_ids.first.to_i 
        
               last_worker_id = target_worker_ids.last.to_i 
        
               if first_worker_id > last_worker_id 
        
                 raise Fluent::ConfigError, "greater first_worker_id<#{first_worker_id}> than last_worker_id<#{last_worker_id}> specified by <worker> directive is not allowed. Available multi worker assign syntax is <smaller_worker_id>-<greater_worker_id>" 
        
               end 
        
               target_worker_ids = [] 
        
               first_worker_id.step(last_worker_id, 1) do |worker_id| 
        
                 target_worker_id = worker_id.to_i 
        
                 target_worker_ids << target_worker_id 
        
                 if target_worker_id < 0 || target_worker_id > (Fluent::Engine.system_config.workers - 1) 
        
                   raise Fluent::ConfigError, "worker id #{target_worker_id} specified by <worker> directive is not allowed. Available worker id is between 0 and #{(Fluent::Engine.system_config.workers - 1)}" 
        
                 end 
        
                 available_worker_ids.delete(target_worker_id) if available_worker_ids.include?(target_worker_id) 
        
                 if used_worker_ids.include?(target_worker_id) 
        
                   raise Fluent::ConfigError, "specified worker_id<#{worker_id}> collisions is detected on <worker> directive. Available worker id(s): #{available_worker_ids}" 
        
                 end 
        
                 used_worker_ids << target_worker_id 
        
                 e.elements.each do |elem| 
        
                   unless ['source', 'match', 'filter', 'label'].include?(elem.name) 
        
                     raise Fluent::ConfigError, "<worker> section cannot have <#{elem.name}> directive" 
        
                   end 
        
                 end 
        
                 unless target_worker_ids.empty? 
        
                   e.set_target_worker_ids(target_worker_ids.uniq) 
        
                 end 
        
               end 
        
             else 
        
               target_worker_id = target_worker_id_str.to_i 
        
               if target_worker_id < 0 || target_worker_id > (Fluent::Engine.system_config.workers - 1) 
        
                 raise Fluent::ConfigError, "worker id #{target_worker_id} specified by <worker> directive is not allowed. Available worker id is between 0 and #{(Fluent::Engine.system_config.workers - 1)}" 
        
               end 
        
               e.elements.each do |elem| 
        
                 unless ['source', 'match', 'filter', 'label'].include?(elem.name) 
        
                   raise Fluent::ConfigError, "<worker> section cannot have <#{elem.name}> directive" 
        
                 end 
        
                 elem.set_target_worker_id(target_worker_id) 
        
               end 
        
             end 
        
             conf += e 
        
           end 
        
           conf.elements.delete_if{|e| e.name == 'worker'}

Fluent::StaticConfigAnalysis seems be quit different than it.

ashie · 2021-10-26T05:14:26Z

I want to add configuration dynamically by adding files, each with worker tags. Is that not supportive? But why is there no problem at startup, only on reload

Hmm, it's ambiguous whether multiple <worker> for same ID is supported or not.
The duplication check is introduced at #2292 only for <worker n-m> syntax.
<worker n> syntax is allowed multiple both before & after it.

ashie · 2021-10-26T05:35:28Z

I think it would be better that check duplication for <worker n> syntax too, but show only warning and don't block loading to keep compatibility.

This replaces the current `GracefulReload` (`SIGUSR2`) (fluent#2716) with the new feature on non-Windows: * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/config.gracefulReload` * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (fluent#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see fluent#4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * fluent#4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional GracefulReload: * The traditional GracefulReload feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * fluent#2259 * fluent#3469 * fluent#3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `GracefulReload` (`SIGUSR2`) (#2716) with the new feature on non-Windows: * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/config.gracefulReload` * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional GracefulReload: * The traditional GracefulReload feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Co-authored-by: Kentaro Hayashi <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Signed-off-by: Daijiro Fukuda <[email protected]> Co-authored-by: Shizuo Fujita <[email protected]> Co-authored-by: Kentaro Hayashi <[email protected]>

kenhys closed this as completed Jul 26, 2021

kenhys reopened this Jul 27, 2021

kenhys added the bug Something isn't working label Jul 27, 2021

ashie mentioned this issue Oct 7, 2021

Consistent gracefulReload RPC #3415

Open

ashie self-assigned this Oct 18, 2021

daipom mentioned this issue Feb 10, 2023

system_config.workers value is wrong when validating config on launching #4051

Closed

daipom mentioned this issue Nov 25, 2024

SIGUSR2: zero downtime restart #4624

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The reload configuration with the worker tag failed #3469

The reload configuration with the worker tag failed #3469

xidiandb commented Jul 26, 2021

kenhys commented Jul 26, 2021

xidiandb commented Jul 26, 2021

xidiandb commented Jul 26, 2021

kenhys commented Jul 27, 2021

kenhys commented Jul 27, 2021

ashie commented Oct 7, 2021

ashie commented Oct 7, 2021 •

edited

Loading

ashie commented Oct 15, 2021

ashie commented Oct 15, 2021

ashie commented Oct 26, 2021

ashie commented Oct 26, 2021

The reload configuration with the worker tag failed #3469

The reload configuration with the worker tag failed #3469

Comments

xidiandb commented Jul 26, 2021

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

kenhys commented Jul 26, 2021

xidiandb commented Jul 26, 2021

xidiandb commented Jul 26, 2021

kenhys commented Jul 27, 2021

kenhys commented Jul 27, 2021

ashie commented Oct 7, 2021

ashie commented Oct 7, 2021 • edited Loading

ashie commented Oct 15, 2021

ashie commented Oct 15, 2021

ashie commented Oct 26, 2021

ashie commented Oct 26, 2021

ashie commented Oct 7, 2021 •

edited

Loading