Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secondary mechanism broken #1449

Closed
sampointer opened this issue Feb 2, 2017 · 14 comments
Closed

secondary mechanism broken #1449

sampointer opened this issue Feb 2, 2017 · 14 comments
Labels
bug Something isn't working v0.14

Comments

@sampointer
Copy link

sampointer commented Feb 2, 2017

  • fluentd 0.14.11
  • Ubuntu 14.04 LTS on EC2

In the course of attempting to debug this plugin issue: uken/fluent-plugin-elasticsearch#235 I've been unable to get fluentd to fail over to the secondary using only the built-in output plugins.

The configuration I have had working under 0.12.x and that fails under 0.14.11 is below. This fluentd sits on the edge of a logging service and receives logs from host clients:

<source>
  @type tail
  path /var/log/syslog
  pos_file /var/spool/td-agent/syslog.pos
  tag locallogs
  format /^(?<time>[^\s]+) (?<facility>\w+)\.(?<priority>\w+) (?<hostname>[^\s]+) (?<process>[^:]+): *(?<message>.*)$/

  # Support ms time resolution
  time_format %Y-%m-%dT%H:%M:%S.%L%z
  keep_time_key true
</source>

# This source receives logs from remote client fluentd installations:
# the primary manner in which logs are ingested into the service
<source>
  @type secure_forward
  self_hostname xxx.xxx.com
  shared_key xxx

  secure true
  cert_path /etc/td-agent/ssl_cert.pem
  private_key_path /etc/td-agent/ssl_key.pem
  private_key_passphrase

  # Support ms time resolution
  time_as_integer false
</source>

# This source is present so that it can be used to
# ingest the files from /var/spool/td-agent/overflow
# http://docs.fluentd.org/articles/in_forward
<source>
  @type forward
  port 24224
  bind 127.0.0.1
</source>

# Filters
<filter locallogs>
  @type record_transformer
  <record>
    statsd_type increment
    statsd_key syslog.${record["facility"]}.${record["priority"]}
  </record>
  <record>
    ident_product xxx
    ident_environment xxx
    ident_role fluentd-server
    ident_hostname xxx
    host_ident i-1234567890
  </record>
</filter>

# Matched outputs
<match **>
  type copy

  <store>
    @type statsd
    host localhost
    port 8125
    flush_interval 1s
  </store>

  <store>
    @type "aws-elasticsearch-service"
    logstash_format true
    include_tag_key true
    tag_key "@log_name"

    # Implement this fix:
    # https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get-new-connection-from-pool-error/36252/9
    reload_connections false

    # We'd like to gain the .raw fields mapping, like logstash:
    # http://stackoverflow.com/questions/33979226/where-do-raw-fields-come-from-when-using-logstash-with-elasticsearch-output
    #
    # This is so that we can do visualizations on non-analyzed fields.
    #
    # Further reading:
    #
    # https://github.com/uken/fluent-plugin-elasticsearch/pull/194
    # https://github.com/uken/fluent-plugin-elasticsearch/issues/33
    template_name "logstash"
    template_file "/etc/td-agent/elasticsearch-template-es2x.json"

    # Implement this fix:
    # https://github.com/fluent/fluentd/issues/607
    request_timeout 10

    # Support ms time resolution
    time_key 'time'
    time_as_integer false

    # AWS Elasticsearch has a 10M maximum request size.
    # Fluentd serializes messages sent to the Elasticsearch
    # output plugin with msgpack.
    #
    # A small subset of benchmarked messages were shown to
    # compress to between 85 and 66% of their original size.
    #
    # Therefore the buffer_chunk_limit below assumes that
    # msgpack will never be more than 50% efficient.
    buffer_type "file"
    buffer_path "/var/spool/td-agent/"
    buffer_chunk_limit 5m    # AWS ES Service max request size is 10m
    buffer_queue_limit 17800 # 5m chunk with a 90G spool volume, plus headroom
    flush_at_shutdown true
    flush_interval 5s
    retry_limit 2            # Write to secondary quickly

    <endpoint>
      url https://xxx.eu-west-1.es.amazonaws.com
      region eu-west-1
    </endpoint>

    <secondary>
      @type secondary_file
      directory /var/spool/td-agent/overflow/
      basename failed.log
    </secondary>

  </store>
</match>

Under 0.12.x all time_* options are not present (the primary reason for the upgrade is to gain ms time resolution). Furthermore the <secondary> under 0.12.x was plainly @type file (with associated configuration), however fluentd refuses to start in this configuration for us unless we amend this to secondary_file as shown.

In this configuration a failure of the elasticsearch endpoint, which previously caused logs to spool to the secondary, drops chunks on the floor once the various retry thresholds have been breached.

In order to eliminate 0.12.x plugins as being the problem I amended the <match **> configuration to be:

<match **>
  @type "forward"
  send_timeout 60s
  recover_wait 10s
  heartbeat_interval 1s
  phi_threshold 16
  hard_timeout 60s

    <server>
      name localhost
      host localhost
      port 1234
      weight 60
    </server>

    <secondary>
      @type secondary_file
      directory /var/spool/td-agent/overflow/
      basename failed.log
    </secondary>

</match>

I then ran netcat, correctly bound (nc -lk 1234) and observed msgpack output on netcat's stdout. Killing netcat started the retry procedure (as expected). Once that is exhausted I would have expected a failover to the secondary (as happens under 0.12.x). This does not happen

My only conclusion is that under 0.14.11 the secondary mechanism does not work, given that all plugins involved in this test are 0.14.x native and shipped with the gem.

Furthermore, I see that the documentation on the secondary mechanism has been removed from the 0.14.x documentation: http://docs.fluentd.org/v0.14/articles/buffer-plugin-overview

Without support for ms timestamp resolution fluentd is basically useless to us, which is a great shame given the investment we've already given it.

@repeatedly
Copy link
Member

Thanks for the report.
I will work on this issue today.

@sampointer
Copy link
Author

Let me know if I can help with any testing

repeatedly added a commit that referenced this issue Feb 10, 2017
Secondary calculation should consider retry_max_times. fix #1449
@repeatedly
Copy link
Member

master branch should fix this problem.
Could you retry your configuration with master branch?

I will release v0.14.13 soon after fixed several bugs.

@sampointer
Copy link
Author

Hi,

I'm having a hard time testing this. For multiple plugins the following is output and fluentd refuses to start:

2017-02-13 16:40:21 +0000 [error]: #0 unexpected error error_class=NameError error="uninitialized constant Fluent::Input"
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-cloudwatch-logs-0.3.3/lib/fluent/plugin/in_cloudwatch_logs.rb:4:in `<module:Fluent>'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-cloudwatch-logs-0.3.3/lib/fluent/plugin/in_cloudwatch_logs.rb:1:in `<top (required)>'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/registry.rb:102:in `block in search'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/registry.rb:99:in `each'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/registry.rb:99:in `search'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/registry.rb:44:in `lookup'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/plugin.rb:146:in `new_impl'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/plugin.rb:100:in `new_input'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/root_agent.rb:236:in `add_source'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/root_agent.rb:95:in `block in configure'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/root_agent.rb:92:in `each'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/root_agent.rb:92:in `configure'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/engine.rb:117:in `configure'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/engine.rb:91:in `run_configure'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/supervisor.rb:743:in `run_configure'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/supervisor.rb:494:in `block in run_worker'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/supervisor.rb:671:in `call'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/supervisor.rb:671:in `main_process'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/supervisor.rb:490:in `run_worker'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/lib/fluent/command/fluentd.rb:300:in `<top (required)>'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:69:in `require'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:69:in `require'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.14.12/bin/fluentd:5:in `<top (required)>'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/bin/fluentd:23:in `load'
  2017-02-13 16:40:21 +0000 [error]: #0 /opt/td-agent/embedded/bin/fluentd:23:in `<top (required)>'
  2017-02-13 16:40:21 +0000 [error]: #0 /usr/sbin/td-agent:7:in `load'
  2017-02-13 16:40:21 +0000 [error]: #0 /usr/sbin/td-agent:7:in `<main>'

I notice there are no "compat" classes in that stack trace to preserve the old internal 0.12.x API for plugins.

This is a retrograde step. As I've only finally today managed to get some quite trivial changes merged into a 3rd party plugin after 6 months of polite attempts I don't hold any hope that I'll be able to compel plugin maintainers to do the right thing without this layer in my present setup.

@repeatedly
Copy link
Member

repeatedly commented Feb 13, 2017

The problem is plugin itself. See this thread: https://groups.google.com/d/msg/fluentd/lrW-gRGda2c/wTinVOzIEAAJ

@sampointer
Copy link
Author

Both plugins worked on 0.14.11 without change. They have broken with master.

From memory, when I forced stack traces out of them there was a compatibility class layer being invoked, which is missing from the stack traces above.

What I am implying is that there has been an interface change on a point version within fluentd.

I understand the Fluentd::Input to Fluent:Plugin:Input hierarchy change in 0.14.x for canonical compatibility, but up until this release there was a shim layer that allowed plugins written for the 0.12.x interface to continue working.

Most of the value from fluentd comes from the 3rd-party ecosystem, and most of the plugins (from my experience) have not and will not change their interface any time soon.

@repeatedly
Copy link
Member

repeatedly commented Feb 13, 2017

Both plugins worked on 0.14.11 without change.

Really? That's weird...
Let me check.

@sampointer
Copy link
Author

sampointer commented Feb 13, 2017

require 'fluent/compat/input'

module Fluent
  Input = Fluent::Compat::Input
end

Should be being used (fluentd/lib/fluent/input.rb), but isn't

@repeatedly
Copy link
Member

repeatedly commented Feb 13, 2017

Yeah. v0.12 plugin should write require 'fluent/input' in the top of code.
This mentions in above mailing list.
In v0.14, plugin uses fluent/plugin/input instead of fluent/input

@sampointer
Copy link
Author

sampointer commented Feb 13, 2017

Right! So the 0.12.x class interface is still there, but the plugins are relying on it being implicitly loaded and it no longer is?

That's something that has changed since 0.14.11.

What I'm suggesting is that, whilst it quite obviously is a bug with the plugin(s), my experience over the last 6 months of attempting to get even trivial changes merged and released by 3rd-party authors is not good.

Whilst they should fix their plugins, practicality might dictate that this needs to be fixed in fluentd with a deprecation warning, a dummy plugin that can be loaded to account for slow-moving plugins or some other mechanism.

I'm happy to help with the development for any effort in that regard.

EDIT: what I also don't understand is that lib/fluent/load.rb still includes require 'fluent/input'. As I understand things, the plugins should not need to require it themselves ..?

@repeatedly
Copy link
Member

the plugins are relying on it being implicitly loaded and it no longer is?

In v0.12 or earlier, fluentd loads all files under fluentd so the plugin works without explicit requires.
This is bad practice. Loading unnecessary files consumes more memory and makes startup slow.
To improve this problem, v0.14 requires only required files in each file.
So this is why uninitialized constant happens.

@repeatedly
Copy link
Member

@sampointer
Copy link
Author

I was preparing the same patch myself :)

@sampointer
Copy link
Author

Success!

For my particular setup fluent-plugins-nursery/fluent-plugin-cloudwatch-logs#67 is also required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v0.14
Projects
None yet
Development

No branches or pull requests

2 participants