Skip to content

Task/Processes entering hung state/livelock in logstash 1.5.0-1 servers,resulting in data loss. #3727

@suraj-kamath

Description

@suraj-kamath

Hi ,
We observed certain processes getting into the hung state intermittently in one of our production servers running on LS 1.5.0-1 .
Redis version : redis-2.8.3-1.el6.remi.x86_64.
Redis gem : logstash-input-redis-0.1.5,logstash-output-redis-0.1.4.

The issue was observed when there was a data loss over 6 hours , upon deeper analysis this was found from the dmesg output(below)

Logs from dmesg:

[130080.754158] INFO: task jbd2/vda1-8:404 blocked for more than 120 seconds.
[130080.760642] Not tainted 2.6.32-504.3.3.el6.x86_64 #1
[130080.763900] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[130080.768831] jbd2/vda1-8 D 0000000000000000 0 404 2 0x00000000
[130080.772753] ffff8804347f7c20 0000000000000046 ffff8804347f7bb0 ffffffff81041e98
[130080.775250] 00000000ffffffff 00014aef7c132530 0000000000190490 ffff880433d83ad0
[130080.777580] 000000000488db84 ffffffffaa3742d6 ffff880433cda5f8 ffff8804347f7fd8
[130080.779830] Call Trace:
[130080.780769] [] ? pvclock_clocksource_read+0x58/0xd0
[130080.782369] [] ? ktime_get_ts+0xb1/0xf0
[130080.783821] [] ? sync_buffer+0x0/0x50
[130080.785258] [] io_schedule+0x73/0xc0
[130080.786579] [] sync_buffer+0x40/0x50
[130080.787920] [] __wait_on_bit+0x5f/0x90
[130080.789402] [] ? sync_buffer+0x0/0x50
[130080.790815] [] out_of_line_wait_on_bit+0x78/0x90
[130080.792304] [] ? wake_bit_function+0x0/0x50
[130080.793723] [] __wait_on_buffer+0x26/0x30
[130080.795212] [] jbd2_journal_commit_transaction+0x1181/0x1500 [jbd2]
[130080.797359] [] ? autoremove_wake_function+0x0/0x40
[130080.798860] [] kjournald2+0xb8/0x220 [jbd2]
[130080.800291] [] ? autoremove_wake_function+0x0/0x40
[130080.801929] [] ? kjournald2+0x0/0x220 [jbd2]
[130080.803370] [] kthread+0x9e/0xc0
[130080.804686] [] child_rip+0xa/0x20
[130080.805980] [] ? kthread+0x0/0xc0
[130080.807384] [] ? child_rip+0x0/0x20
[130080.808815] INFO: task monit:7737 blocked for more than 120 seconds.
[130080.810374] Not tainted 2.6.32-504.3.3.el6.x86_64 #1
[130080.811792] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[130080.814006] monit D 0000000000000002 0 7737 1 0x00000080
[130080.815728] ffff88043681ddc8 0000000000000082 ffff88043681dd90 ffff88043681dd8c
[130080.818115] 000000000000000e ffff88043fc24500 00007628f0511016 ffff8800282158c0
[130080.820557] 0000000000000400 0000000107b9cf25 ffff880436800638 ffff88043681dfd8
[130080.822826] Call Trace:
[130080.823715] [] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[130080.825291] [] ? autoremove_wake_function+0x0/0x40
[130080.826805] [] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[130080.828434] [] ext4_sync_file+0x121/0x1d0 [ext4]
[130080.829960] [] vfs_fsync_range+0xa1/0x100
[130080.831305] [] vfs_fsync+0x1d/0x20
[130080.832593] [] do_fsync+0x3e/0x60
[130080.833877] [] sys_fsync+0x10/0x20
[130080.835203] [] system_call_fastpath+0x16/0x1b
[130080.836679] INFO: task redis-server:16342 blocked for more than 120 seconds.
[130080.838347] Not tainted 2.6.32-504.3.3.el6.x86_64 #1
[130080.839740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[130080.841842] redis-server D 0000000000000000 0 16342 7669 0x00000080
[130080.843495] ffff8802ff523dc8 0000000000000086 ffff8802ff523e70 0000000000000001
[130080.845683] 000000000000000e ffff8802ff523de8 ffff8802ff523d88 ffffffff81123d80
[130080.847905] ffff880418442220 ffff8802ff523de8 ffff880438b9e5f8 ffff8802ff523fd8
[130080.850120] Call Trace:
[130080.850987] [] ? find_get_pages_tag+0x40/0x130
[130080.852500] [] ? prepare_to_wait+0x4e/0x80
[130080.853907] [] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[130080.855517] [] ? autoremove_wake_function+0x0/0x40
[130080.857042] [] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[130080.858701] [] ext4_sync_file+0x121/0x1d0 [ext4]
[130080.860364] [] vfs_fsync_range+0xa1/0x100
[130080.861914] [] vfs_fsync+0x1d/0x20
[130080.863271] [] do_fsync+0x3e/0x60
[130080.864585] [] sys_fsync+0x10/0x20
[130080.865924] [] system_call_fastpath+0x16/0x1b
[130440.867121] INFO: task auditd:976 blocked for more than 120 seconds.
[130440.887808] Not tainted 2.6.32-504.3.3.el6.x86_64 #1
[130440.889428] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[130440.891858] auditd D 0000000000000002 0 976 1 0x00000000
[130440.893890] ffff8804345c5dc8 0000000000000086 0000000000000000 0000000000000001
[130440.896336] 000000000000000e ffff8804345c5de8 000076857ec710a1 ffffffff81123d80
[130440.898530] ffff88043658c9f0 0000000107bfe1f2 ffff880439b245f8 ffff8804345c5fd8
[130440.900691] Call Trace:
[130440.901570] [] ? find_get_pages_tag+0x40/0x130
[130440.903118] [] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[130440.904704] [] ? autoremove_wake_function+0x0/0x40
[130440.906237] [] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[130440.907907] [] ext4_sync_file+0x121/0x1d0 [ext4]
[130440.909426] [] vfs_fsync_range+0xa1/0x100
[130440.910797] [] vfs_fsync+0x1d/0x20
[130440.912099] [] do_fsync+0x3e/0x60
[130440.913418] [] sys_fsync+0x10/0x20
[130440.914720] [] system_call_fastpath+0x16/0x1b
[130440.916265] INFO: task monit:7737 blocked for more than 120 seconds.
[130440.917819] Not tainted 2.6.32-504.3.3.el6.x86_64 #1
[130440.919187] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[130440.921423] monit D 0000000000000000 0 7737 1 0x00000080
[130440.923205] ffff88043681ddc8 0000000000000082 ffff88043681de70 0000000000000001
[130440.925459] 000000000000000e ffff88043681dde8 ffff88043681dd88 ffffffff81123d80
[130440.927732] ffff8803e9691dd8 ffff88043681dde8 ffff880436800638 ffff88043681dfd8
[130440.929976] Call Trace:
[130440.930840] [] ? find_get_pages_tag+0x40/0x130
[130440.932328] [] ? prepare_to_wait+0x4e/0x80
[130440.933761] [] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[130440.935380] [] ? autoremove_wake_function+0x0/0x40
[130440.936969] [] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[130440.938589] [] ext4_sync_file+0x121/0x1d0 [ext4]
[130440.940136] [] vfs_fsync_range+0xa1/0x100
[130440.941543] [] vfs_fsync+0x1d/0x20
[130440.942868] [] do_fsync+0x3e/0x60
[130440.944152] [] sys_fsync+0x10/0x20
[130440.945456] [] system_call_fastpath+0x16/0x1b
[130800.947600] INFO: task monit:7737 blocked for more than 120 seconds.
[130800.952255] Not tainted 2.6.32-504.3.3.el6.x86_64 #1
[130800.955780] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[130800.961182] monit D 0000000000000000 0 7737 1 0x00000080
[130800.965362] ffff88043681ddc8 0000000000000082 0000000000000000 0000000000000001
[130800.972321] 000000000000000e ffff88043681dde8 000076b55fefce7b ffffffff81123d80
[130800.978541] ffff8803e9691dd8 0000000107c304ac ffff880436800638 ffff88043681dfd8
[130800.984111] Call Trace:
[130800.985754] [] ? find_get_pages_tag+0x40/0x130
[130800.987516] [] ? prepare_to_wait+0x4e/0x80
[130800.989063] [] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
[130800.990687] [] ? autoremove_wake_function+0x0/0x40
[130800.992253] [] jbd2_complete_transaction+0x68/0xb0 [jbd2]
[130800.994023] [] ext4_sync_file+0x121/0x1d0 [ext4]
[130800.995597] [] vfs_fsync_range+0xa1/0x100
[130800.997073] [] vfs_fsync+0x1d/0x20
[130801.004995] [] do_fsync+0x3e/0x60
[130801.006273] [] sys_fsync+0x10/0x20
[130801.007618] [] system_call_fastpath+0x16/0x1b

As observed processes/task getting "blocked for more than 120 seconds" occur.We believe this stalled the logstash related processes to enter a livelock state.On restarting the logstash related processes the data flow seemed to start. Also the redis instance on that server was @ 100% CPU when this issue occured.

This kind of issue/error was never observed during the use of logstash 1.4.1 , even on a production environment and is beleived to be only observed once switched over to logstash 1.5.0-1.

The question now is whether this is a logstash related issue , or anything got to do with the redis version impacting this on logstash I/O . ? If so what can be done to avoid this issue .
Also is there a specific version of redis required while working with logstash 1.5.0-1 ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions