[Gluster-users] How to debug a hanging client?

Fri May 13 15:12:53 UTC 2011

Error messages on pserver12 (opt*.log)

[2011-05-13 11:41:58.812937] E
[client-handshake.c:116:rpc_client_ping_timer_expired] 0-storage0-client-0:
Server 10.6.0.108:24009 has not responded in the last 5 seconds,
disconnecting.
[2011-05-13 12:11:57.954369] E [rpc-clnt.c:199:call_bail]
0-storage0-client-0: bailing out frame type(GlusterFS Handshake) op(PING(3))
xid = 0x210x sent = 2011-05-13 11:41:53.422855. timeout = 1800
[2011-05-13 12:11:57.954415] E [rpc-clnt.c:199:call_bail]
0-storage0-client-0: bailing out frame type(GlusterFS 3.1) op(LOOKUP(27))
xid = 0x209x sent = 2011-05-13 11:41:53.422846. timeout = 1800

Errors on pserver8 (the peer):

[2011-05-13 14:51:26.727334] E
[rdma.c:3423:rdma_handle_failed_send_completion] 0-rpc-transport/rdma: send
work request
on `mlx4_0' returned error wc.status = 12, wc.vendor_err = 129, post->buf =
0x43fa000, wc.byte_len = 0, post->reused = 8
9791
[2011-05-13 14:51:26.727374] E
[rdma.c:3431:rdma_handle_failed_send_completion] 0-rdma: connection between
client and se
rver not working. check by running 'ibv_srq_pingpong'. also make sure subnet
manager is running (eg: 'opensm'), or check
 if rdma port is valid (or active) by running 'ibv_devinfo'. contact Gluster
Support Team if the problem persists.
[2011-05-13 14:51:26.727617] E [rpc-clnt.c:340:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x77) [0x
7f397dd0ba07] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e)
[0x7f397dd0b19e] (-->/usr/lib/libgfrpc.so.0(s
aved_frames_destroy+0xe) [0x7f397dd0b0fe]))) 0-rpc-clnt: forced unwinding
frame type(GF-DUMP) op(DUMP(1)) called at 2011
-05-13 14:51:22.620059
[2011-05-13 14:51:26.727670] M
[client-handshake.c:1178:client_dump_version_cbk] 0-: some error, retry
again later
[2011-05-13 14:51:26.727686] I [client.c:1601:client_rpc_notify]
0-storage0-client-1: disconnected

Could this be a bad IB card? After a reboot of pserver12 the system work
again, a try to shut down and restart just the ib0 interface failed (hung)

Best, Martin

-----Original Message-----
From: Martin Schenker [mailto:martin.schenker at profitbricks.com] 
Sent: Friday, May 13, 2011 3:36 PM
To: 'gluster-users at gluster.org'
Subject: How to debug a hanging client?

Hi all!

We have on server/client where the client part hangs quite often.

Strace shows:
0 root at de-blnstage-c2-pserver12:~ # strace -Tfv -p 12407 (
Process 12407 attached with 6 threads - interrupt to quit
[pid 12417] futex(0x2cb98a8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 12412] read(12,  <unfinished ...>
[pid 12411] read(11,  <unfinished ...>
[pid 12410] futex(0x2cb9330, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 12408] rt_sigtimedwait([HUP INT TRAP BUS USR1 USR2 PIPE ALRM TERM CHLD
TTOU], NULL, NULL, 8

I can read from the server mountpoint just fine but any access to the fuse
mounted glusterfs hangs and can only be killed.

Any idea how to resolve this? If I try to kill all glusterfs process the
kill -9 on the process

root     12407     1  0 May11 ?        00:00:01 /usr/sbin/glusterfs
--log-level=NORMAL --volfile-id=storage0 --volfile-server=localhost
/opt/profitbricks/storage

will hang as well. Just like an NFS server hang... waiting for I/O

Thanks, Martin