[Gluster-users] CentOS 5.5 kernel bugs can cause temporary hangs upon client access to GlusterFS

Burnash, James jburnash at knight.com
Tue Jul 12 14:33:41 UTC 2011


Got a complaint from a user - the native GlusterFS mountpoint was completely inaccessible from many (if not all) clients attempting to read or write from it.

Apparently not the fault of GlusterFS - here's the entry from the messages file:

Jul  8 16:15:13 jc1letgfs13 kernel: [3022057.692284] INFO: task glusterfsd:12902 blocked for more than 120 seconds.
Jul  8 16:15:13 jc1letgfs13 kernel: [3022057.692544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 16:15:13 jc1letgfs13 kernel: [3022057.693037] glusterfsd    D ffffffff80151248     0 12902      1         12904 12903 (NOTLB)
Jul  8 16:15:13 jc1letgfs13 kernel: [3022057.693553]  ffff81061190bbf8 0000000000000086 ffff81061190bea8 0000000000000000
Jul  8 16:15:13 jc1letgfs13 kernel: [3022057.694099]  000000000000000c 000000000000000a ffff810627eec0c0 ffff810c27f32100
Jul  8 16:15:13 jc1letgfs13 kernel: [3022057.694660]  000abc5dc58f770c 0000000000005135 ffff810627eec2a8 000000038000b3fd

... and here's one for a non-Gluster process:

Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.761299] INFO: task jbd2/cciss!c2d0:4090 blocked for more than 120 seconds.
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.761908] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.762505] jbd2/cciss!c2 D ffffffff80151248     0  4090    456          4091  4085 (L-TLB)
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.763129]  ffff810617e45d60 0000000000000046 ffff810617e45da0 ffffffff8008ccb0
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.763753]  ffff810617e45cf0 000000000000000a ffff81063d22e820 ffff810c20b3c100
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.764370]  000abbf070cd535b 0000000000003c6a ffff81063d22ea08 0000000300000000
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.764693] Call Trace:
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.765247]  [<ffffffff8008ccb0>] find_busiest_group+0x20d/0x621
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.765543]  [<ffffffff88342fad>] :jbd2:jbd2_journal_commit_transaction+0x191/0x1080
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.766064]  [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.766327]  [<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.766588]  [<ffffffff8004b6b6>] try_to_del_timer_sync+0x7f/0x88
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.766853]  [<ffffffff88346d72>] :jbd2:kjournald2+0x9a/0x1ec
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.767109]  [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.767374]  [<ffffffff88346cd8>] :jbd2:kjournald2+0x0/0x1ec
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.767627]  [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.767880]  [<ffffffff80032bdc>] kthread+0xfe/0x132
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.768138]  [<ffffffff8005efb1>] child_rip+0xa/0x11
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.768399]  [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.768656]  [<ffffffff80032ade>] kthread+0x0/0x132
Jul  8 16:07:13 jc1letgfs13 kernel: [3021577.768922]  [<ffffffff8005efa7>] child_rip+0x0/0x11

Haven't found the specific bug number for this (CentOS 5.5) yet.

Running GlusterFS 3.1.3 on clients and 2 servers setup up as Replicated-Distribute.

Hopefully this will help others. I will be upgrading to CentOS 5.6 as soon as possible on these servers.

Kudos to my coworker Joe Collette for running this issue to ground.

James Burnash
Unix Engineer
Knight Capital Group



DISCLAIMER: 
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. 
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com



More information about the Gluster-users mailing list