[Gluster-users] Gluster NFS crashing

Mon May 19 07:39:11 UTC 2014

Traceback from core file (16GB)

#0  0x00007f0e3d924925 in raise () from /lib64/libc.so.6
#1  0x00007f0e3d926105 in abort () from /lib64/libc.so.6
#2  0x00007f0e3d962837 in __libc_message () from /lib64/libc.so.6
#3  0x00007f0e3d968166 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f0e3f0e2e0f in rpcsvc_drc_op_destroy (drc=0x21e2780, reply=0x7f0e302ee470) at rpc-drc.c:47
#5  0x00007f0e3f331bc1 in rb_destroy (tree=0x7f0e302ee02c, destroy=0x7f0e3f0e2e80 <rpcsvc_drc_rb_op_destroy>) at ../../contrib/rbtree/rb.c:876
#6  0x00007f0e3f0e2b5f in rpcsvc_remove_drc_client (drc=0x21e2780, client=0x230e540) at rpc-drc.c:84
#7  rpcsvc_drc_client_unref (drc=0x21e2780, client=0x230e540) at rpc-drc.c:316
#8  0x00007f0e3f0e2c98 in rpcsvc_drc_notify (svc=<value optimized out>, xl=<value optimized out>, event=<value optimized out>, data=0x230f670) at rpc-drc.c:683
#9  0x00007f0e3f0d9d35 in rpcsvc_handle_disconnect (svc=0x21a6990, trans=0x230f670) at rpcsvc.c:682
#10 0x00007f0e3f0db880 in rpcsvc_notify (trans=0x230f670, mydata=<value optimized out>, event=<value optimized out>, data=0x230f670) at rpcsvc.c:720
#11 0x00007f0e3f0dcf98 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512
#12 0x00007f0e3a93c9a1 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x230f670, poll_in=<value optimized out>, poll_out=0, 
    poll_err=0) at socket.c:1071
#13 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x230f670, poll_in=<value optimized out>, poll_out=0, poll_err=0)
    at socket.c:2239
#14 0x00007f0e3f3512f7 in event_dispatch_epoll_handler (event_pool=0x2186ef0) at event-epoll.c:384
#15 event_dispatch_epoll (event_pool=0x2186ef0) at event-epoll.c:445
#16 0x00000000004075e4 in main (argc=11, argv=0x7fffabef9e38) at glusterfsd.c:1983


On Mon, 2014-05-19 at 14:39 +0800, Franco Broi wrote: 
> Just had an NFS crash on my test system running 3.5.
> 
> Load of messages like this:
> 
> [2014-05-19 06:24:59.347147] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
> [2014-05-19 06:24:59.347240] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
> [2014-05-19 06:24:59.347340] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
> [2014-05-19 06:24:59.347408] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
> 
> followed by:
> 
> ....
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> 
> patchset: git://git.gluster.com/glusterfs.git
> signal received: 6
> time of crash: 2014-05-19 06:25:13configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> fdatasync 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.5.0
> /lib64/libc.so.6(+0x329a0)[0x7f0e3d9249a0]
> /lib64/libc.so.6(gsignal+0x35)[0x7f0e3d924925]
> /lib64/libc.so.6(abort+0x175)[0x7f0e3d926105]
> /lib64/libc.so.6(+0x70837)[0x7f0e3d962837]
> /lib64/libc.so.6(+0x76166)[0x7f0e3d968166]
> /usr/lib64/libgfrpc.so.0(+0x10e0f)[0x7f0e3f0e2e0f]
> /usr/lib64/libglusterfs.so.0(rb_destroy+0x51)[0x7f0e3f331bc1]
> /usr/lib64/libgfrpc.so.0(+0x10b5f)[0x7f0e3f0e2b5f]
> /usr/lib64/libgfrpc.so.0(rpcsvc_drc_notify+0xe8)[0x7f0e3f0e2c98]
> /usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x7f0e3f0d9d35]
> /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x7f0e3f0db880]
> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f0e3f0dcf98]
> /usr/lib64/glusterfs/3.5.0/rpc-transport/socket.so(+0xa9a1)[0x7f0e3a93c9a1]
> /usr/lib64/libglusterfs.so.0(+0x672f7)[0x7f0e3f3512f7]
> /usr/sbin/glusterfs(main+0x564)[0x4075e4]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0e3d910d1d]
> /usr/sbin/glusterfs[0x404679]
> 
> Volume Name: data2
> Type: Distribute
> Volume ID: d958423f-bd25-49f1-81f8-f12e4edc6823
> Status: Started
> Number of Bricks: 8
> Transport-type: tcp
> Bricks:
> Brick1: nas5-10g:/data17/gvol
> Brick2: nas5-10g:/data18/gvol
> Brick3: nas5-10g:/data19/gvol
> Brick4: nas5-10g:/data20/gvol
> Brick5: nas6-10g:/data21/gvol
> Brick6: nas6-10g:/data22/gvol
> Brick7: nas6-10g:/data23/gvol
> Brick8: nas6-10g:/data24/gvol
> Options Reconfigured:
> cluster.min-free-disk: 5%
> network.frame-timeout: 10800
> cluster.readdir-optimize: on
> nfs.disable: off
> nfs.export-volumes: on
> performance.readdir-ahead: off
> 
> 
> 
> On Thu, 2014-05-01 at 09:55 +0800, Franco Broi wrote: 
> > Installed 3.4.3 exactly 2 weeks ago on all our brick servers and I'm
> > happy to report that we've not had a crash since.
> > 
> > Thanks for all the good work.
> > 
> > On Tue, 2014-04-15 at 14:22 +0800, Franco Broi wrote: 
> > > The whole system came to a grinding halt today and no amount of
> > > restarting daemons would make it work again. What was really odd was
> > > that gluster vol status said everything was fine and yet all the client
> > > mount points had hung.
> > > 
> > > On the node that was exporting Gluster NFS I had zombie processes so I
> > > decided to reboot, took a while for the ZFS JBOD's to sort themselves
> > > out but I was relieved when it all came back up - except that the df
> > > size on the clients was wrong...
> > > 
> > > gluster vol info and gluster vol status said everything was fine but it
> > > was obvious that 2 of my bricks were missing. I restarted everything,
> > > and still 2 missing brick. I remounted the fuse clients and still no
> > > good.
> > > 
> > > Just out of sheer desperation and for no good reason I disabled the
> > > Gluster NFS export and magically the missing 2 bricks reappeared and the
> > > filesystem was back to its normal size. I turned NFS exports back on and
> > > everything stayed working.
> > > 
> > > I'm not trying to belittle all the good work done by the Gluster
> > > developers but this really doesn't look like a viable big data
> > > filesystem at the moment. We've currently got 800TB and are about to add
> > > another 400TB but quite honestly the prospect terrifies me.
> > > 
> > > 
> > > On Tue, 2014-04-15 at 08:35 +0800, Franco Broi wrote: 
> > > > On Mon, 2014-04-14 at 17:29 -0700, Harshavardhana wrote: 
> > > > > >
> > > > > > Just distributed.
> > > > > >
> > > > > 
> > > > > Pure distributed setup you have to take a downtime, since the data
> > > > > isn't replicated.
> > > > 
> > > > If I shutdown the server processes, wont the clients just wait for it to
> > > > come back up? Ie like NFS hard mounts? I don't mind an interruption, I
> > > > just want to avoid killing all jobs that are currently accessing the
> > > > filesystem if at all possible, our users have suffered a lot recently
> > > > with filesystem outages.
> > > > 
> > > > By the way, how does one shutdown the glusterfs processes without
> > > > stopping a volume? It would be nice to have a quiesce or freeze option
> > > > that just stalls all access while maintenance takes place.
> > > > 
> > > > > 
> > > > > >>
> > > > > >> > 3.4.1 to 3.4.3-3 shouldn't cause problems with existing clients and
> > > > > >> > other servers, right?
> > > > > >> >
> > > > > >>
> > > > > >> You mean 3.4.1 and 3.4.3 co-existent with in a cluster?
> > > > > >
> > > > > > Yes, at least for the duration of the upgrade.
> > > > > 
> > > > > Yeah 3.4.x series is backward compatible to each other in any case.
> > > > > 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Gluster-users mailing list
> > > > Gluster-users at gluster.org
> > > > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> > > 
> > > 
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users at gluster.org
> > > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> > 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users