You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're using fluentd in k8s to fetch all the tailing logs and send it to a mongo database. Generally, it works as expected.
After long time, it happens randomly that fluentd stops tailing a file, so we lost important log data for more than 1 day. Other fluentd apps (on other nodes) worked as expected, and given node worked as expected for everything except for that tailing file. This is not the first time it happened (last time we noticed it on May 6th).
That tailed file is a log file of a pod that has huge amount of logs. Prior to the issue pod was running 3-4 days without stop without any issues with logging. Last log that passed through from that file was at "2022-06-25T19:17:35.751Z". Before that moment, each time that rotation of that file happened, there was a pair of logs: detected rotation + following tail. Later, each time that rotation happened, it logger only: detected rotation.
Before (last, and each before):
2022-06-25T19:13:05.741980613Z stdout F 2022-06-25 19:13:05 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log; waiting 5 seconds
2022-06-25T19:13:05.743587912Z stdout F 2022-06-25 19:13:05 +0000 [info]: #0 [in_tail_container_logs] following tail of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log
After (first, and every next):
2022-06-25T19:17:36.744586203Z stdout F 2022-06-25 19:17:36 +0000 [info]: #0 [in_tail_container_logs] detected rotation of /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log; waiting 5 seconds
To note file at /var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log looked healthy and ok, and was constantly outputting healthy logs (it was rotated once each ~12 minutes). We mitigated by just restarting that api pod - it changed name and that changed file name, which fluentd followed fine later. Problem existed since that problematic point in time until we did mitigation, for only that file. There was no need to restart fluentd.
Cannot repro, so I'm putting whole config in (db config is replaced when deploying).
# Inputs from container logs
<source>
@type tail
@id in_tail_container_logs
path /var/log/containers/*.log
pos_file /var/log/fluentd.log.pos
read_from_head
tag kubernetes.*
<parse>
@type cri
</parse>
</source>
# Merge logs split into multiple lines
<filter kubernetes.**>
@type concat
key message
use_partial_cri_logtag true
partial_cri_logtag_key logtag
partial_cri_stream_key stream
separator ""
</filter>
# Enriches records with Kubernetes metadata
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
# Prettify kubernetes metadata
<filter kubernetes.**>
@type record_transformer
enable_ruby
<record>
nodeName ${record.dig("kubernetes", "host")}
namespaceName ${record.dig("kubernetes", "namespace_name")}
podName ${record.dig("kubernetes", "pod_name")}
containerName ${record.dig("kubernetes", "container_name")}
containerImage ${record.dig("kubernetes", "container_image")}
</record>
remove_keys docker,kubernetes
</filter>
# Expands inner json
<filter kubernetes.**>
@type parser
format json
key_name message
reserve_data true
remove_key_name_field true
emit_invalid_record_to_error false
time_format %Y-%m-%dT%H:%M:%S.%NZ
time_key time
keep_time_key
</filter>
# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
@type rename_key
replace_rule1 \$ [dollar]
</filter>
# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
@type rename_key
replace_rule1 \. [dot]
</filter>
# Outputs to log db
<match kubernetes.**>
@type mongo
connection_string "#{ENV['MONGO_ANALYTICS_DB_HOST']}"
collection logs
<buffer>
flush_thread_count 8
flush_interval 1s
</buffer>
</match>
Your Error Log
Not sure how you mean all - we have a ton of logs, most of them lost. The best logs I could find were noted in the 'Describe the bug' part, you can ask for some specifics if needed.
Additional context
Running docker image fluent/fluentd:v1.13-1 in k8s 1.21 on Digital Ocean as a daemon set.
The text was updated successfully, but these errors were encountered:
Ping, any advice on this one? It happens unexpectedly from time to time and is very very painful, it takes a lot of our logs and messes up our analytics...
Describe the bug
We're using fluentd in k8s to fetch all the tailing logs and send it to a mongo database. Generally, it works as expected.
After long time, it happens randomly that fluentd stops tailing a file, so we lost important log data for more than 1 day. Other fluentd apps (on other nodes) worked as expected, and given node worked as expected for everything except for that tailing file. This is not the first time it happened (last time we noticed it on May 6th).
That tailed file is a log file of a pod that has huge amount of logs. Prior to the issue pod was running 3-4 days without stop without any issues with logging. Last log that passed through from that file was at "2022-06-25T19:17:35.751Z". Before that moment, each time that rotation of that file happened, there was a pair of logs: detected rotation + following tail. Later, each time that rotation happened, it logger only: detected rotation.
Before (last, and each before):
After (first, and every next):
To note file at
/var/log/containers/api-6d756f4555-jt7zz_default_api-c7e8458c243f365986db0861cb3f73de09d1d4b61a045f52276465c6c3559290.log
looked healthy and ok, and was constantly outputting healthy logs (it was rotated once each ~12 minutes). We mitigated by just restarting that api pod - it changed name and that changed file name, which fluentd followed fine later. Problem existed since that problematic point in time until we did mitigation, for only that file. There was no need to restart fluentd.To Reproduce
Can't repro.
Expected behavior
Not to stop tailing.
Your Environment
Your Configuration
Cannot repro, so I'm putting whole config in (db config is replaced when deploying).
Your Error Log
Not sure how you mean all - we have a ton of logs, most of them lost. The best logs I could find were noted in the 'Describe the bug' part, you can ask for some specifics if needed.
Additional context
Running docker image fluent/fluentd:v1.13-1 in k8s 1.21 on Digital Ocean as a daemon set.
The text was updated successfully, but these errors were encountered: