[Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected

Fred Hucht fred at thp.uni-due.de
Tue Nov 25 13:22:35 UTC 2008


Hi,

crawling through all /var/log/messages, I found on one of the failing  
nodes (node68)

Nov 25 04:04:12 node68 kernel: INFO: task pw.x:20052 blocked for more  
than 120 seconds.
Nov 25 04:04:12 node68 kernel: "echo 0 > /proc/sys/kernel/ 
hung_task_timeout_secs" disables this message.
Nov 25 04:04:12 node68 kernel: pw.x          D ffff81027c3d5d68     0  
20052      1
Nov 25 04:04:12 node68 kernel:  ffff81027c3d5d48 0000000000000086  
ffff81021c0e7460 0000000000000000
Nov 25 04:04:12 node68 kernel:  ffff81041f14e800 000000038022a7ae  
ffff81041f314238 ffff81041f314000
Nov 25 04:04:12 node68 kernel:  0000000000000000 0000000000000001  
0000000000000246 0000000000000003
Nov 25 04:04:12 node68 kernel: Call Trace:
Nov 25 04:04:12 node68 kernel:  [<ffffffff882ae9c7>] :fuse:request_send 
+0x2c8/0x2f0
Nov 25 04:04:12 node68 kernel:  [<ffffffff80242ab3>]  
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel:  [<ffffffff80242ab3>]  
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel:   
[<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38
Nov 25 04:04:12 node68 kernel:   
[<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e
Nov 25 04:04:12 node68 kernel:  [<ffffffff882b188e>] :fuse:fuse_open 
+0x0/0x7
Nov 25 04:04:12 node68 kernel:  [<ffffffff80286e30>] __dentry_open 
+0xe6/0x1ba
Nov 25 04:04:12 node68 kernel:  [<ffffffff80286f2a>] nameidata_to_filp 
+0x26/0x35
Nov 25 04:04:12 node68 kernel:  [<ffffffff80286f66>] do_filp_open+0x2d/ 
0x3d
Nov 25 04:04:12 node68 kernel:  [<ffffffff80287180>]  
get_unused_fd_flags+0x104/0x113
Nov 25 04:04:12 node68 kernel:  [<ffffffff802872a3>] do_sys_open 
+0x46/0xc3
Nov 25 04:04:12 node68 kernel:  [<ffffffff8020b08b>]  
system_call_after_swapgs+0x7b/0x80
Nov 25 04:04:12 node68 kernel:
Nov 25 04:04:12 node68 kernel: INFO: task pw.x:20053 blocked for more  
than 120 seconds.
Nov 25 04:04:12 node68 kernel: "echo 0 > /proc/sys/kernel/ 
hung_task_timeout_secs" disables this message.
Nov 25 04:04:12 node68 kernel: pw.x          D ffff8101c5083d68     0  
20053      1
Nov 25 04:04:12 node68 kernel:  ffff8101c5083d48 0000000000000086  
ffff81021c0e7460 0000000000000000
Nov 25 04:04:12 node68 kernel:  ffff81041f14a800 000000008022a7ae  
ffff81021d8b9238 ffff81021d8b9000
Nov 25 04:04:12 node68 kernel:  0000000000000000 0000000000000001  
0000000000000246 0000000000000003
Nov 25 04:04:12 node68 kernel: Call Trace:
Nov 25 04:04:12 node68 kernel:  [<ffffffff882ae9c7>] :fuse:request_send 
+0x2c8/0x2f0
Nov 25 04:04:12 node68 kernel:  [<ffffffff80242ab3>]  
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel:  [<ffffffff80242ab3>]  
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel:   
[<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38
Nov 25 04:04:12 node68 kernel:   
[<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e
Nov 25 04:04:12 node68 kernel:  [<ffffffff882b188e>] :fuse:fuse_open 
+0x0/0x7
Nov 25 04:04:12 node68 kernel:  [<ffffffff80286e30>] __dentry_open 
+0xe6/0x1ba
Nov 25 04:04:12 node68 kernel:  [<ffffffff80286f2a>] nameidata_to_filp 
+0x26/0x35
Nov 25 04:04:12 node68 kernel:  [<ffffffff80286f66>] do_filp_open+0x2d/ 
0x3d
Nov 25 04:04:12 node68 kernel:  [<ffffffff80287180>]  
get_unused_fd_flags+0x104/0x113
Nov 25 04:04:12 node68 kernel:  [<ffffffff802872a3>] do_sys_open 
+0x46/0xc3
Nov 25 04:04:12 node68 kernel:  [<ffffffff8020b08b>]  
system_call_after_swapgs+0x7b/0x80
Nov 25 04:04:12 node68 kernel:

The other two failing nodes had nothing related in the logs. Note that  
pw.x:20052 and pw.x:20053 are the two parallel jobs running on this  
node.

A similar error was logged during the crash two days ago on node22:

Nov 23 14:16:43 node22 kernel: INFO: task pw.x:32355 blocked for more  
than 120 seconds.
Nov 23 14:16:43 node22 kernel: "echo 0 > /proc/sys/kernel/ 
hung_task_timeout_secs" disables this message.
Nov 23 14:16:43 node22 kernel: pw.x          D ffff8102049c1d68     0  
32355      1
Nov 23 14:16:43 node22 kernel:  ffff8102049c1d48 0000000000000082  
ffff81013e0e1c60 0000000000000000
Nov 23 14:16:43 node22 kernel:  ffff81021e4ea000 000000038022a7ae  
ffff81021f004a38 ffff81021f004800
Nov 23 14:16:43 node22 kernel:  0000000000000000 0000000000000001  
0000000000000246 0000000000000003
Nov 23 14:16:43 node22 kernel: Call Trace:
Nov 23 14:16:43 node22 kernel:  [<ffffffff882ae9c7>] :fuse:request_send 
+0x2c8/0x2f0
Nov 23 14:16:43 node22 kernel:  [<ffffffff80242ab3>]  
autoremove_wake_function+0x0/0x2e
Nov 23 14:16:43 node22 kernel:  [<ffffffff80242ab3>]  
autoremove_wake_function+0x0/0x2e
Nov 23 14:16:43 node22 kernel:   
[<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38
Nov 23 14:16:43 node22 kernel:   
[<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e
Nov 23 14:16:43 node22 kernel:  [<ffffffff882b188e>] :fuse:fuse_open 
+0x0/0x7
Nov 23 14:16:43 node22 kernel:  [<ffffffff80286e30>] __dentry_open 
+0xe6/0x1ba
Nov 23 14:16:43 node22 kernel:  [<ffffffff80286f2a>] nameidata_to_filp 
+0x26/0x35
Nov 23 14:16:43 node22 kernel:  [<ffffffff80286f66>] do_filp_open+0x2d/ 
0x3d
Nov 23 14:16:43 node22 kernel:  [<ffffffff80287180>]  
get_unused_fd_flags+0x104/0x113
Nov 23 14:16:43 node22 kernel:  [<ffffffff802872a3>] do_sys_open 
+0x46/0xc3
Nov 23 14:16:43 node22 kernel:  [<ffffffff8020b08b>]  
system_call_after_swapgs+0x7b/0x80
Nov 23 14:16:43 node22 kernel:

That's all in /var/log/messages. Remember that the program "pw.x" runs  
without problems via NFS as that this is the only program used for  
testing presently.

Fred

On 25.11.2008, at 13:42, Joe Landman wrote:

> Fred Hucht wrote:
>> Hi!
>> The glusterfsd.log on all nodes are virtually empty, the only entry  
>> on 2008-11-25 reads
>> 2008-11-25 03:13:48 E [io-threads.c:273:iot_flush] sc1-ioth: fd  
>> context is NULL, returning EBADFD
>> on all nodes. I don't think that this is related to our problems.
>> Regards,
>>     Fred
>
> Hi Fred
>
>  Could you post complete /var/log/messages file on pastebin?  I have  
> seen something like this before when fuse crashes.  Fuse crashing  
> could be due to a bug in fuse, the kernel, etc.  Also could be  
> hardware that is failing.
>
>  Does an unmount/remount fix the problem?
>
> Joe
>
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>       http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615

Dr. Fred Hucht <fred at thp.Uni-DuE.de>
Institute for Theoretical Physics
University of Duisburg-Essen, 47048 Duisburg, Germany






More information about the Gluster-devel mailing list