[Gluster-users] glusterfs crash when the one of replicate node restart

Thu Dec 15 11:02:59 UTC 2011

Hi pranithk,

    Thanks for your replay.
    Because to keep availability,we haven't strace the process.After
shudowning  the damon,the cluster recover.
    In our case,
    10.1.1.64(dfs-client-6): online node,when the other node(65)
restart,cpu usr usage reach 100% (glusterfsd process)
    10.1.1.65(dfs-client-7): offline node,when it restart,the client  nfs
mount point  unavailable.

The nfs.log show that the reason of issue will be cause by client-6 high
cpu usage,there are lots of  error like:

[2011-12-14 13:25:53.30308] E [rpc-clnt.c:197:call_bail]
0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33))
xid = 0x89279937x sent = 2011-12-14 13:25:20.
346007. timeout = 30

On Wed, Dec 14, 2011 at 6:49 PM, Pranith Kumar K <pranithk at gluster.com>wrote:

>  On 12/14/2011 03:06 PM, Changliang Chen wrote:
>
> Hi,we have use glusterfs for two years. After upgraded to 3.2.5,we
> discover that when one of replicate node reboot and startup the glusterd
> daemon,the gluster will crash cause by the other
>
>  replicate node cpu usage reach 100%.
>
> Our gluster info:
>
> Type: Distributed-Replicate
> Status: Started
> Number of Bricks: 5 x 2 = 10
> Transport-type: tcp
> Options Reconfigured:
> performance.cache-size: 3GB
> performance.cache-max-file-size: 512KB
> network.frame-timeout: 30
> network.ping-timeout: 25
> cluster.min-free-disk: 10%
>
>  Our device：
>
> Dell R710
> 600Gsas *6
> 3*8Gmem
>
> The error info:
>
> [2011-12-14 13:24:10.483812] E [rdma.c:4813:init] 0-rdma.management:
> Failed to initialize IB Device
> [2011-12-14 13:24:10.483828] E [rpc-transport.c:742:rpc_transport_load]
> 0-rpc-transport: 'rdma' initialization failed
> [2011-12-14 13:24:10.483841] W [rpcsvc.c:1288:rpcsvc_transport_create]
> 0-rpc-service: cannot create listener, initing the transport failed
> [2011-12-14 13:24:11.967621] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-0
> [2011-12-14 13:24:11.967665] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-1
> [2011-12-14 13:24:11.967681] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-2
> [2011-12-14 13:24:11.967695] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-3
> [2011-12-14 13:24:11.967709] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-4
> [2011-12-14 13:24:11.967723] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-5
> [2011-12-14 13:24:11.967736] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-6
> [2011-12-14 13:24:11.967750] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-7
> [2011-12-14 13:24:11.967764] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-8
> [2011-12-14 13:24:11.967777] E
> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
> brick-9
> [2011-12-14 13:24:12.465565] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.17:1013)
> [2011-12-14 13:24:12.465623] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.8:1013)
> [2011-12-14 13:24:12.465656] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.10:1013)
> [2011-12-14 13:24:12.465686] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.11:1013)
> [2011-12-14 13:24:12.465716] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.125:1013)
> [2011-12-14 13:24:12.633288] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.65:1006)
> [2011-12-14 13:24:13.138150] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.1:1013)
> [2011-12-14 13:24:13.284665] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.3:1013)
> [2011-12-14 13:24:15.790805] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.8:1013)
> [2011-12-14 13:24:16.113430] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.125:1013)
> [2011-12-14 13:24:16.259040] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.10:1013)
> [2011-12-14 13:24:16.392058] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.17:1013)
> [2011-12-14 13:24:16.429444] W
> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
> from socket failed. Error (Transport endpoint is not connected), peer (
> 10.1.1.11:1013)
> [2011-12-14 13:26:05.787680] W [glusterfsd.c:727:cleanup_and_exit]
> (-->/lib64/libc.so.6(clone+0x6d) [0x37c8ed3c2d] (-->/lib64/libpthread.so.0
> [0x37c96064a7]
> (-->/opt/glusterfs/3.2.5/sbin/glusterd(glusterfs_sigwaiter+0x17c)
> [0x40477c]))) 0-: received signum (15), shutting down
>
>  --
>
> Regards,
>
> Cocl
>
>
>
> _______________________________________________
> Gluster-users mailing listGluster-users at gluster.orghttp://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>  hi Changliang,
>         Could you specify which process crashed. Is it glusterd or
> glusterfs? Could you provide the stack trace that is present in it's
> respective logfile. I dont see any stack trace in the logs you have
> provided.
>
> Pranith
>

-- 

Regards,

Cocl
OM manager
19lou Operation & Maintenance Dept
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20111215/90626664/attachment.html>
-------------- next part --------------
[2011-12-14 13:24:37.313566] E [afr-self-heal-entry.c:2201:afr_sh_post_nonblocking_entry_cbk] 0-19loudfs-replicate-3: Non Blocking entrylks failed for /sbsforum/attachment/jiaxing/2011/11/24.
[2011-12-14 13:24:37.313606] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-19loudfs-replicate-3: background  entry self-heal failed on /sbsforum/attachment/jiaxing/2011/11/24
[2011-12-14 13:24:37.313696] E [afr-self-heal-entry.c:2201:afr_sh_post_nonblocking_entry_cbk] 0-19loudfs-replicate-3: Non Blocking entrylks failed for /sbsforum/attachment/jiaxing/2011/11/24.
[2011-12-14 13:24:37.313719] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-19loudfs-replicate-3: background  entry self-heal failed on /sbsforum/attachment/jiaxing/2011/11/24
[2011-12-14 13:24:54.469090] E [afr-self-heal-entry.c:2201:afr_sh_post_nonblocking_entry_cbk] 0-19loudfs-replicate-3: Non Blocking entrylks failed for /sbsforum/avatar/s/0.
[2011-12-14 13:24:54.469129] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-19loudfs-replicate-3: background  entry self-heal failed on /sbsforum/avatar/s/0
[2011-12-14 13:24:55.40255] E [afr-self-heal-entry.c:2201:afr_sh_post_nonblocking_entry_cbk] 0-19loudfs-replicate-3: Non Blocking entrylks failed for /sbsforum/attachment/taizhou/2011/12/12/17.
[2011-12-14 13:24:55.40299] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-19loudfs-replicate-3: background  entry self-heal failed on /sbsforum/attachment/taizhou/2011/12/12/17
[2011-12-14 13:25:53.30308] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33)) xid = 0x89279937x sent = 2011-12-14 13:25:20.346007. timeout = 30
[2011-12-14 13:25:53.30404] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(FXATTROP(34)) xid = 0x89278822x sent = 2011-12-14 13:25:20.60117. timeout = 30
[2011-12-14 13:25:53.30454] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33)) xid = 0x89278701x sent = 2011-12-14 13:25:19.891163. timeout = 30
[2011-12-14 13:25:53.30506] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(GETXATTR(18)) xid = 0x89278356x sent = 2011-12-14 13:25:19.837275. timeout = 30
[2011-12-14 13:25:53.30562] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33)) xid = 0x89277981x sent = 2011-12-14 13:25:19.749399. timeout = 30
[2011-12-14 13:25:53.30625] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(GETXATTR(18)) xid = 0x89277977x sent = 2011-12-14 13:25:19.748452. timeout = 30
[2011-12-14 13:25:53.30658] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(FXATTROP(34)) xid = 0x89277971x sent = 2011-12-14 13:25:19.703687. timeout = 30
[2011-12-14 13:25:53.30699] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(READ(12)) xid = 0x89277967x sent = 2011-12-14 13:25:19.697414. timeout = 30
[2011-12-14 13:25:53.30716] E [client3_1-fops.c:2302:client3_1_readv_cbk] 0-19loudfs-client-6: remote operation failed: Transport endpoint is not connected
[2011-12-14 13:25:53.30751] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(READ(12)) xid = 0x89277966x sent = 2011-12-14 13:25:19.697402. timeout = 30
[2011-12-14 13:25:53.30766] E [client3_1-fops.c:2302:client3_1_readv_cbk] 0-19loudfs-client-6: remote operation failed: Transport endpoint is not connected
[2011-12-14 13:25:53.30799] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(READ(12)) xid = 0x89277965x sent = 2011-12-14 13:25:19.697388. timeout = 30
[2011-12-14 13:25:53.30814] E [client3_1-fops.c:2302:client3_1_readv_cbk] 0-19loudfs-client-6: remote operation failed: Transport endpoint is not connected
[2011-12-14 13:25:53.30848] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(READ(12)) xid = 0x89277964x sent = 2011-12-14 13:25:19.697363. timeout = 30
[2011-12-14 13:25:53.30863] E [client3_1-fops.c:2302:client3_1_readv_cbk] 0-19loudfs-client-6: remote operation failed: Transport endpoint is not connected
[2011-12-14 13:25:53.30892] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(FXATTROP(34)) xid = 0x89277952x sent = 2011-12-14 13:25:19.690527. timeout = 30
[2011-12-14 13:26:03.40338] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(READ(12)) xid = 0x89281996x sent = 2011-12-14 13:25:31.226789. timeout = 30
[2011-12-14 13:26:03.40424] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(READ(12)) xid = 0x89281989x sent = 2011-12-14 13:25:25.67507. timeout = 30
[2011-12-14 13:26:13.50375] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(GETXATTR(18)) xid = 0x89282333x sent = 2011-12-14 13:25:37.683375. timeout = 30
[2011-12-14 13:26:23.69350] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(GETXATTR(18)) xid = 0x89283335x sent = 2011-12-14 13:25:52.976995. timeout = 30
[2011-12-14 13:26:33.88579] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(WRITE(13)) xid = 0x89283367x sent = 2011-12-14 13:25:53.281734. timeout = 30
[2011-12-14 13:26:33.88709] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(WRITE(13)) xid = 0x89283366x sent = 2011-12-14 13:25:53.281239. timeout = 30
[2011-12-14 13:26:33.88758] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(WRITE(13)) xid = 0x89283365x sent = 2011-12-14 13:25:53.281168. timeout = 30
[2011-12-14 13:26:33.88804] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(WRITE(13)) xid = 0x89283363x sent = 2011-12-14 13:25:53.279828. timeout = 30
[2011-12-14 13:26:33.88850] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(GETXATTR(18)) xid = 0x89283362x sent = 2011-12-14 13:25:53.279735. timeout = 30
[2011-12-14 13:26:33.88947] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(WRITE(13)) xid = 0x89283360x sent = 2011-12-14 13:25:53.279094. timeout = 30
[2011-12-14 13:26:33.89055] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33)) xid = 0x89283356x sent = 2011-12-14 13:25:53.278381. timeout = 30
[2011-12-14 13:26:33.89124] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(SETATTR(38)) xid = 0x89283350x sent = 2011-12-14 13:25:53.72900. timeout = 30
[2011-12-14 13:26:33.89242] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(FLUSH(15)) xid = 0x89283349x sent = 2011-12-14 13:25:53.72890. timeout = 30
[2011-12-14 13:26:33.89333] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33)) xid = 0x89283343x sent = 2011-12-14 13:25:53.31200. timeout = 30
[2011-12-14 13:26:33.89389] E [rpc-clnt.c:197:call_bail] 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33)) xid = 0x89283342x sent = 2011-12-14 13:25:53.31093. timeout = 30