-
Notifications
You must be signed in to change notification settings - Fork 307
Description
(check apply)
Problem
We are using td-agent v3.7.1
This is using fluentd version 1.10.2 and fluent-plugin-elasticsearch v4.0.7
We have a 3 node local elasticsearch cluster setup where starting up the td-agent will continue to work for around 18-20 hours after which we start to see fluentd fail with the following error:
2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=16 next_retry_seconds=2020-05-29 17:18:06 +0000 chunk="5a6904a4700dc751015bf6f7fb2e0bc1" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.1\" does not match the server certificate (OpenSSL::SSL::SSLError)"
2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=17 next_retry_seconds=2020-05-27 02:35:12 +0000 chunk="5a690483041ba29bda96202b35491072" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.2\" does not match the server certificate (OpenSSL::SSL::SSLError)"
2020-05-26 17:37:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=18 next_retry_seconds=2020-05-27 11:17:37 +0000 chunk="5a69048c8e1d158c8826c73a15f903b0" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.mydomain.io\", :port=>9200, :scheme=>\"https\", :user=>\"my_user\", :password=>\"obfuscated\"}): hostname \"10.0.0.3\" does not match the server certificate (OpenSSL::SSL::SSLError)"
Steps to replicate
Leave td-agent running for long enough.
<source>
@type tail
@id in_tail_my_log_log
path "D:/Logs/my-logs-*.log"
pos_file "C:/opt/td-agent/etc/td-agent/pos_files/logs.pos"
tag "my-logs.*"
enable_watch_timer false
enable_stat_watcher true
read_from_head true
<parse>
@type "json"
</parse>
</source>
<filter **>
@type elasticsearch_genid
hash_id_key _hash
</filter>
<match my-logs.**>
@type copy
<store>
id_key _hash
remove_keys _hash
@type "elasticsearch"
@log_level debug
host "elasticsearch.mydomain.io"
port 9200
scheme https
ssl_version TLSv1_2
logstash_format true
logstash_prefix "my-logs"
logstash_dateformat "%Y.%m"
include_tag_key true
user "fluentd_user"
password "XXXXX"
type_name "_doc"
tag_key "@log_name"
<buffer>
flush_thread_count 8
flush_interval 5s
</buffer>
</store>
</match>
Expected Behavior or What you need to ask
no need to restart fluentd
Using Fluentd and ES plugin versions
- Fluentd or td-agent version: td-agent 3.1.1
- Operating system: Windows Server 2016
- Elasticsearch version: 6.2.0
Additional context
What is interesting, is that logs will be shipped consistently and then will suddenly stop working. Also to note we have 3 separate servers each shipping logs to the same elasticsearch cluster, and all 3 servers will eventually (around the same time) fail with the exact same reason.
A restart of the fluentd service gets rid of the issue, but any logs in the buffer are lost and manual recovery has to be done.
Another point to make is using a much older version of td-agent v3.1.1 which uses fluentd v1.0.2 and fluent-plugin-elasticsearch v2.4.0 works with no issues.
Using the old version of td-agent, we have been running for over a week with no issues.