[Gluster-users] glusterfs crash when the one of replicate node restart

Changliang Chen hqucocl at gmail.com
Fri Dec 16 11:32:27 UTC 2011


Hi pranithk,

     The attachment provide three logs with nfslog,client6 log,client7 log.

On Thu, Dec 15, 2011 at 7:46 PM, Pranith Kumar K <pranithk at gluster.com>wrote:

>  On 12/15/2011 04:32 PM, Changliang Chen wrote:
>
> Hi pranithk,
>
>      Thanks for your replay.
>     Because to keep availability,we haven't strace the process.After
> shudowning  the damon,the cluster recover.
>     In our case,
>      10.1.1.64(dfs-client-6): online node,when the other node(65)
> restart,cpu usr usage reach 100% (glusterfsd process)
>     10.1.1.65(dfs-client-7): offline node,when it restart,the client  nfs
> mount point  unavailable.
>
> The nfs.log show that the reason of issue will be cause by client-6 high
> cpu usage,there are lots of  error like:
>
>  [2011-12-14 13:25:53.30308] E [rpc-clnt.c:197:call_bail]
> 0-19loudfs-client-6: bailing out frame type(GlusterFS 3.1) op(XATTROP(33))
> xid = 0x89279937x sent = 2011-12-14 13:25:20.
> 346007. timeout = 30
>
>
>
>
>
>
>
> On Wed, Dec 14, 2011 at 6:49 PM, Pranith Kumar K <pranithk at gluster.com>wrote:
>
>>   On 12/14/2011 03:06 PM, Changliang Chen wrote:
>>
>>  Hi,we have use glusterfs for two years. After upgraded to 3.2.5,we
>> discover that when one of replicate node reboot and startup the glusterd
>> daemon,the gluster will crash cause by the other
>>
>>  replicate node cpu usage reach 100%.
>>
>> Our gluster info:
>>
>> Type: Distributed-Replicate
>> Status: Started
>> Number of Bricks: 5 x 2 = 10
>> Transport-type: tcp
>> Options Reconfigured:
>> performance.cache-size: 3GB
>> performance.cache-max-file-size: 512KB
>> network.frame-timeout: 30
>> network.ping-timeout: 25
>> cluster.min-free-disk: 10%
>>
>>  Our device:
>>
>> Dell R710
>> 600Gsas *6
>> 3*8Gmem
>>
>> The error info:
>>
>> [2011-12-14 13:24:10.483812] E [rdma.c:4813:init] 0-rdma.management:
>> Failed to initialize IB Device
>> [2011-12-14 13:24:10.483828] E [rpc-transport.c:742:rpc_transport_load]
>> 0-rpc-transport: 'rdma' initialization failed
>> [2011-12-14 13:24:10.483841] W [rpcsvc.c:1288:rpcsvc_transport_create]
>> 0-rpc-service: cannot create listener, initing the transport failed
>> [2011-12-14 13:24:11.967621] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-0
>> [2011-12-14 13:24:11.967665] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-1
>> [2011-12-14 13:24:11.967681] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-2
>> [2011-12-14 13:24:11.967695] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-3
>> [2011-12-14 13:24:11.967709] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-4
>> [2011-12-14 13:24:11.967723] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-5
>> [2011-12-14 13:24:11.967736] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-6
>> [2011-12-14 13:24:11.967750] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-7
>> [2011-12-14 13:24:11.967764] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-8
>> [2011-12-14 13:24:11.967777] E
>> [glusterd-store.c:1820:glusterd_store_retrieve_volume] 0-: Unknown key:
>> brick-9
>> [2011-12-14 13:24:12.465565] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.17:1013)
>> [2011-12-14 13:24:12.465623] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.8:1013)
>> [2011-12-14 13:24:12.465656] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.10:1013)
>> [2011-12-14 13:24:12.465686] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.11:1013)
>> [2011-12-14 13:24:12.465716] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.125:1013)
>> [2011-12-14 13:24:12.633288] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.65:1006)
>> [2011-12-14 13:24:13.138150] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.1:1013)
>> [2011-12-14 13:24:13.284665] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.3:1013)
>> [2011-12-14 13:24:15.790805] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.8:1013)
>> [2011-12-14 13:24:16.113430] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.125:1013)
>> [2011-12-14 13:24:16.259040] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.10:1013)
>> [2011-12-14 13:24:16.392058] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.17:1013)
>> [2011-12-14 13:24:16.429444] W
>> [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading
>> from socket failed. Error (Transport endpoint is not connected), peer (
>> 10.1.1.11:1013)
>> [2011-12-14 13:26:05.787680] W [glusterfsd.c:727:cleanup_and_exit]
>> (-->/lib64/libc.so.6(clone+0x6d) [0x37c8ed3c2d] (-->/lib64/libpthread.so.0
>> [0x37c96064a7]
>> (-->/opt/glusterfs/3.2.5/sbin/glusterd(glusterfs_sigwaiter+0x17c)
>> [0x40477c]))) 0-: received signum (15), shutting down
>>
>>  --
>>
>> Regards,
>>
>> Cocl
>>
>>
>>
>>  _______________________________________________
>> Gluster-users mailing listGluster-users at gluster.orghttp://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>>  hi Changliang,
>>         Could you specify which process crashed. Is it glusterd or
>> glusterfs? Could you provide the stack trace that is present in it's
>> respective logfile. I dont see any stack trace in the logs you have
>> provided.
>>
>> Pranith
>>
>
>
>
>  --
>
> Regards,
>
> Cocl
> OM manager
> 19lou Operation & Maintenance Dept
>
> Could you send the logs of all the machines, we will check and getback to
> you.
>
> Pranith
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: etc-glusterfs-glusterd.vol.log_64
Type: application/octet-stream
Size: 16831 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: etc-glusterfs-glusterd.vol.log_65
Type: application/octet-stream
Size: 30471 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nfs_log.rar
Type: application/rar
Size: 372034 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20111216/b0956254/attachment.bin>


More information about the Gluster-users mailing list