Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail: Fluentd process goes unresponsive/hung and not tailing any of the log files #3882

Closed
SriramDuvvuri opened this issue Sep 1, 2022 · 16 comments

Comments

@SriramDuvvuri
Copy link

Describe the bug

As commented before in the #3614 (comment), the fluentd goes unresponsive or hung.

We tried upgrading the fluentd version from v1.11.5 to v1.15.2 on one of the server where we had a issue, but still we face the issue with the latest build.

[root@aau-lgiu06fms1 06:36:28]:/var/log/td-agent> ls -ltr
total 10024
drwxr-xr-x 2 td-agent td-agent 6 Aug 22 23:54 buffer
-rw-r--r-- 1 td-agent td-agent 59796 Aug 25 03:17 td-agent.log-20220825.gz
-rw-r----- 1 td-agent td-agent 35586 Aug 25 11:00 td-agent.log-20220826.gz
-rw-r----- 1 td-agent td-agent 129378 Aug 26 19:33 td-agent.log-20220827.gz
-rw-r----- 1 td-agent td-agent 878069 Aug 29 20:38 td-agent.log-20220830.gz
-rw-r----- 1 td-agent td-agent 53933 Aug 31 03:47 td-agent.log-20220831.gz
-rw-r--r-- 1 td-agent td-agent 306 Aug 31 15:05 t1.pos
-rw-r--r-- 1 td-agent td-agent 89 Aug 31 15:13 o1.pos
-rw-r--r-- 1 td-agent td-agent 73 Aug 31 15:14 s3.pos
-rw-r--r-- 1 td-agent td-agent 73 Aug 31 15:14 s2.pos
-rw-r--r-- 1 td-agent td-agent 96 Aug 31 15:14 b1.pos
-rw-r--r-- 1 td-agent td-agent 198 Aug 31 15:14 a1.pos
-rw-r--r-- 1 td-agent td-agent 19583 Aug 31 15:14 a2.pos
-rw-r--r-- 1 td-agent td-agent 13033 Aug 31 15:14 a3.pos
-rw-r--r-- 1 td-agent td-agent 68 Aug 31 15:14 s1.pos
-rw-r----- 1 td-agent td-agent 9027630 Aug 31 18:25 td-agent.log-20220901
-rw-r----- 1 td-agent td-agent 0 Sep 1 03:23 td-agent.log

To Reproduce

No error log message found in the logs.

Expected behavior

Logs to be pushed as usual after file rotation as fluentd

Your Environment

- Fluentd version:1.11.5/1.15.2
- TD Agent version:3.8.1/4.4.1
- Operating system:RHEL7.9
- Kernel version:3.10.0-1160.36.2.el7.x86_64

Your Configuration

<source>
@type tail
  path /home/server/xx*.log
  exclude_path ["/home/server/xxyy*.log"]
  pos_file /var/log/td-agent/xxlog.pos
  tag xx.log
  enable_stat_watcher false
  read_from_head true
  follow_inodes true
  refresh_interval 5
  rotate_wait 5
  skip_refresh_on_startup true
  pos_file_compaction_interval 30m
  <parse>
    @type multiline
    format_firstline /<20/
    format1 /<(?<logDate>.*?)>-<(?<logThread>.*?)>-<(?<level>.*?)>-<(?<logClass>.*?)>-(?<logMessage>.*)/
    time_key logDate
    time_format %Y/%m/%d-%H:%M:%S.%N
  </parse>
</source>

Your Error Log

No specific error log identified apart from the pattern not matched errors

2022-09-01 06:49:17 +0100 [info]: #0 adding forwarding server 'XXXXXXXX:24224' host="XXXXXXX" port=24224 weight=60 plugin_id="object:94c"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: adding source type="tail"
2022-09-01 06:49:17 +0100 [info]: #0 starting fluentd worker pid=10319 ppid=10316 worker=0
2022-09-01 06:49:17 +0100 [info]: #0 delayed_commit_timeout is overwritten by ack_response_timeout
2022-09-01 06:49:17 +0100 [info]: #0 following tail of /usr/ops/e1.log
2022-09-01 06:49:17 +0100 [warn]: #0 pattern not matched: "2022-08-31 15:15:01 [/usr/ops] /usr/ops/e1.log  {}  -- "
2022-09-01 06:49:17 +0100 [warn]: #0 pattern not matched: "2022-08-31 15:15:01 [/usr/ops] :: total        used        free      shared  buff/cache   available"

Additional context

We are seeing this issue on both 1.11.5 and 1.15.2 versions.

@SriramDuvvuri
Copy link
Author

@ashie request to please let me know if any workaround or remediations available for this unresponsive nature of td-agent issue.

Thanks in advance.

@ashie
Copy link
Member

ashie commented Sep 9, 2022

Could you try to get stack trace of Fluentd's worker process by sending CONT signal?

@ashie
Copy link
Member

ashie commented Sep 9, 2022

$ kill -CONT <pid of fluentd worker>
$ cat /tmp/sigdump-<pid of fluentd worker>.log

@SriramDuvvuri
Copy link
Author

@ashie Thanks for the response. here are the extracted sigdmp files from two instances. Please verify and let me know if you find any issues
sigdmp1.txt
sigdmp2.txt

Please let me know if any information needed for further investigation.

@SriramDuvvuri
Copy link
Author

@ashie did you get any chance to go thorough the above sigdmp files? Please let me know if you need any further details here to identify the issue.

@ashie
Copy link
Member

ashie commented Sep 14, 2022

It seems that your log is supervisor's one, not worker's one.
Fluentd launches 1 parent process (supervisor) and 1 child process (worker) by default.

[Fluentd Supervisor]
            |
            +-- [Fluentd Worker]

We need woker's log.
If you don't know which process is worker, please get logs for all fluentd processes in your system.

@SriramDuvvuri
Copy link
Author

sigdump-L2G.log
sigdump-N3S.log
@ashie Please find the attached sigdmps of worker nodes. Please verify and let me know if any information is needed.

@SriramDuvvuri
Copy link
Author

SriramDuvvuri commented Sep 19, 2022

sigdump-L2G.log sigdump-N3S.log @ashie Please find the attached sigdmps of worker nodes. Please verify and let me know if any information is needed.

@ashie Can you verify these files and let me know if you find any trace of issue in these files?

Thanks in advance.

@ashie
Copy link
Member

ashie commented Sep 20, 2022

It seems that fluentd process isn't stuck, probably it's just waiting next line.
Now I've noticed that you are using multiline parser.
In this case, in_tail won't emit event until next <20 line is arrived.
How about using multiline_flush_interval?

@SriramDuvvuri
Copy link
Author

It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next <20 line is arrived. How about using multiline_flush_interval?

@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:

image

The log files shows this one line:

image

Do I need to modify any thing in the output config as well?

<match *.log>
@type forward
require_ack_response true
expire_dns_cache 300
dns_round_robin true
transport tls
tls_cert_path /etc/td-agent/ca-bundle.crt
keepalive true
keepalive_timeout 300s
#include_timestamp true

host XXXXXXXXXXX
port XXXXX


flush_interval 5s

@SriramDuvvuri
Copy link
Author

It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next <20 line is arrived. How about using multiline_flush_interval?

@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:

image

The log files shows this one line:

image

Do I need to modify any thing in the output config as well?

<match *.log> @type forward require_ack_response true expire_dns_cache 300 dns_round_robin true transport tls tls_cert_path /etc/td-agent/ca-bundle.crt keepalive true keepalive_timeout 300s #include_timestamp true host XXXXXXXXXXX port XXXXX flush_interval 5s

@ashie any further I can try to see if that can resolve the issue.

@SriramDuvvuri
Copy link
Author

It seems that fluentd process isn't stuck, probably it's just waiting next line. Now I've noticed that you are using multiline parser. In this case, in_tail won't emit event until next <20 line is arrived. How about using multiline_flush_interval?

@ashie I tried placing the multiline_flush_interval 5s, but still unresponsiveness of td-agent.log continues as shown in these snippets:
image
The log files shows this one line:
image
Do I need to modify any thing in the output config as well?
<match *.log> @type forward require_ack_response true expire_dns_cache 300 dns_round_robin true transport tls tls_cert_path /etc/td-agent/ca-bundle.crt keepalive true keepalive_timeout 300s #include_timestamp true host XXXXXXXXXXX port XXXXX flush_interval 5s

@ashie any further I can try to see if that can resolve the issue.

@ashie Any update on this to mitigate the issue?

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

@github-actions github-actions bot added the stale label Apr 11, 2023
@daipom daipom changed the title Fluentd process goes unresponsive/hung and not tailing any of the log files in_tail: Fluentd process goes unresponsive/hung and not tailing any of the log files Apr 11, 2023
@SriramDuvvuri
Copy link
Author

@ashie can we get any update on this.. is there any fix from the product end.. we still face this issue and seem the log rotate file issue..

@daipom
Copy link
Contributor

daipom commented Jan 9, 2024

Do you mean that in_tail does not collect the contents of the file at all?
Restarting Fluentd does not recover?
If so, #3614 may not be related.
It may be possible that the config is wrong or that the file format is not supported.
Could you give us the sample logs?

@daipom
Copy link
Contributor

daipom commented Apr 10, 2024

It's been a while, so I close this.
If this issue still occurs, please reopen this.

@daipom daipom closed this as completed Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants