[Gluster-users] ganesha.nfsd process dies when copying files

Wed Aug 15 11:14:08 UTC 2018

On Wed, 2018-08-15 at 13:42 +0800, Pui Edylie wrote:
> Hi Karli,
> 
> I think Alex is right in regards with the NFS version and state.
> 
> I am only using NFSv3 and the failover is working per expectation.

OK, so I've remade the test again and it goes like this:

1) Start copy loop[*]
2) Power off hv02
3) Copy loop stalls indefinitely

I have attached a snippet of the ctdb log that looks interesting but
doesn't say much to me execpt that something's wrong:)

[*]: while true; do mount -o vers=3 hv03v.localdomain:/data /mnt/; dd
if=/var/tmp/test.bin of=/mnt/test.bin bs=1M status=progress; rm -fv
/mnt/test.bin; umount /mnt; done

Thanks in advance!

/K

> 
> In my use case, I have 3 nodes with ESXI 6.7 as OS and setup 1x 
> gluster VM on each of the ESXI host using its local datastore.
> 
> Once I have formed the replicate 3, I use the CTDB VIP to present the
> NFS3 back to the Vcenter and uses it as a shared storage.
> 
> Everything works great other than performance is not very good ... I
> am still looking for ways to improve it.
> 
> Cheers,
> Edy
> 
> On 8/15/2018 12:25 AM, Alex Chekholko wrote:
> > Hi Karli,
> > 
> > I'm not 100% sure this is related, but when I set up my ZFS NFS HA
> > per https://github.com/ewwhite/zfs-ha/wiki I was not able to get
> > the failover to work with NFS v4 but only with NFS v3.
> > 
> > From the client point of view, it really looked like with NFS v4
> > there is an open file handle and that just goes stale and hangs, or
> > something like that, whereas with NFSv3 the client retries and
> > recovers and continues.  I did not investigate further, I just use
> > v3.  I think it has something to do with NFSv4 being "stateful" and
> > NFSv3 being "stateless".
> > 
> > Can you re-run your test but using NFSv3 on the client mount?  Or
> > do you need to use v4.x?
> > 
> > Regards,
> > Alex
> > 
> > On Tue, Aug 14, 2018 at 6:11 AM Karli Sjöberg <karli at inparadise.se>
> > wrote:
> > > On Fri, 2018-08-10 at 09:39 -0400, Kaleb S. KEITHLEY wrote:
> > > > On 08/10/2018 09:23 AM, Karli Sjöberg wrote:
> > > > > On Fri, 2018-08-10 at 21:23 +0800, Pui Edylie wrote:
> > > > > > Hi Karli,
> > > > > > 
> > > > > > Storhaug works with glusterfs 4.1.2 and latest nfs-ganesha.
> > > > > > 
> > > > > > I just installed them last weekend ... they are working
> > > very well
> > > > > > :)
> > > > > 
> > > > > Okay, awesome!
> > > > > 
> > > > > Is there any documentation on how to do that?
> > > > > 
> > > > 
> > > > https://github.com/gluster/storhaug/wiki
> > > > 
> > > 
> > > Thanks Kaleb and Edy!
> > > 
> > > I have now redone the cluster using the latest and greatest
> > > following
> > > the above guide and repeated the same test I was doing before
> > > (the
> > > rsync while loop) with success. I let (forgot) it run for about a
> > > day
> > > and it was still chugging along nicely when I aborted it, so
> > > success
> > > there!
> > > 
> > > On to the next test; the catastrophic failure test- where one of
> > > the
> > > servers dies, I'm having a more difficult time with.
> > > 
> > > 1) I start with mounting the share over NFS 4.1 and then proceed
> > > with
> > > writing a 8 GiB large random data file with 'dd', while "hard-
> > > cutting"
> > > the power to the server I'm writing to, the transfer just stops
> > > indefinitely, until the server comes back again. Is that supposed
> > > to
> > > happen? Like this:
> > > 
> > > # dd if=/dev/urandom of=/var/tmp/test.bin bs=1M count=8192
> > > # mount -o vers=4.1 hv03v.localdomain:/data /mnt/
> > > # dd if=/var/tmp/test.bin of=/mnt/test.bin bs=1M status=progress
> > > 2434793472 bytes (2,4 GB, 2,3 GiB) copied, 42 s, 57,9 MB/s
> > > 
> > > (here I cut the power and let it be for almost two hours before
> > > turning
> > > it on again)
> > > 
> > > dd: error writing '/mnt/test.bin': Remote I/O error
> > > 2325+0 records in
> > > 2324+0 records out
> > > 2436890624 bytes (2,4 GB, 2,3 GiB) copied, 6944,84 s, 351 kB/s
> > > # umount /mnt
> > > 
> > > Here the unmount command hung and I had to hard reset the client.
> > > 
> > > 2) Another question I have is why some files "change" as you copy
> > > them
> > > out to the Gluster storage? Is that the way it should be? This
> > > time, I
> > > deleted eveything in the destination directory to start over:
> > > 
> > > # mount -o vers=4.1 hv03v.localdomain:/data /mnt/
> > > # rm -f /mnt/test.bin
> > > # dd if=/var/tmp/test.bin of=/mnt/test.bin bs=1M status=progress
> > > 8557428736 bytes (8,6 GB, 8,0 GiB) copied, 122 s, 70,1 MB/s
> > > 8192+0 records in
> > > 8192+0 records out
> > > 8589934592 bytes (8,6 GB, 8,0 GiB) copied, 123,039 s, 69,8 MB/s
> > > # md5sum /var/tmp/test.bin 
> > > 073867b68fa8eaa382ffe05adb90b583  /var/tmp/test.bin
> > > # md5sum /mnt/test.bin 
> > > 634187d367f856f3f5fb31846f796397  /mnt/test.bin
> > > # umount /mnt
> > > 
> > > Thanks in advance!
> > > 
> > > /K
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users at gluster.org
> > > https://lists.gluster.org/mailman/listinfo/gluster-users
>  
-------------- next part --------------
2018/08/15 13:04:08.223469 ctdb-eventd[3467]: monitor event timed out                 
2018/08/15 13:04:08.367973 ctdb-eventd[3467]: event_debug: ===== Start of hung script debug for PID="27179", event="monitor" =====
2018/08/15 13:04:08.368003 ctdb-eventd[3467]: event_debug: pstree -p -a 27179:      
2018/08/15 13:04:08.368013 ctdb-eventd[3467]: event_debug: 60.nfs,27179 /etc/ctdb/events.d/60.nfs monitor
2018/08/15 13:04:08.368021 ctdb-eventd[3467]: event_debug:   `-60.nfs,27235 /etc/ctdb/events.d/60.nfs monitor
2018/08/15 13:04:08.368029 ctdb-eventd[3467]: event_debug:       `-60.nfs,27236 /etc/ctdb/events.d/60.nfs monitor
2018/08/15 13:04:08.368037 ctdb-eventd[3467]: event_debug:           `-rpcinfo,27237 -T tcp 127.0.0.1 mountd 1
2018/08/15 13:04:08.368045 ctdb-eventd[3467]: event_debug: ---- Stack trace of interesting process 27237[rpcinfo] ----
2018/08/15 13:04:08.368052 ctdb-eventd[3467]: event_debug: [<ffffffff9cc310d5>] poll_schedule_timeout+0x55/0xb0
2018/08/15 13:04:08.368060 ctdb-eventd[3467]: event_debug: [<ffffffff9cc3263d>] do_sys_poll+0x4cd/0x580
2018/08/15 13:04:08.368067 ctdb-eventd[3467]: event_debug: [<ffffffff9cc327f4>] SyS_poll+0x74/0x110
2018/08/15 13:04:08.368075 ctdb-eventd[3467]: event_debug: [<ffffffff9d120795>] system_call_fastpath+0x1c/0x21
2018/08/15 13:04:08.368083 ctdb-eventd[3467]: event_debug: [<ffffffffffffffff>] 0xffffffffffffffff
2018/08/15 13:04:08.368090 ctdb-eventd[3467]: event_debug: ---- ctdb scriptstatus monitor: ----
2018/08/15 13:04:08.368098 ctdb-eventd[3467]: event_debug: 00.ctdb              OK         0.021 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368105 ctdb-eventd[3467]: event_debug: 01.reclock           OK         0.044 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368113 ctdb-eventd[3467]: event_debug: 05.system            OK         0.060 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368120 ctdb-eventd[3467]: event_debug: 06.nfs               OK         0.018 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368128 ctdb-eventd[3467]: event_debug: 10.external          DISABLED
2018/08/15 13:04:08.368135 ctdb-eventd[3467]: event_debug: 10.interface         OK         0.039 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368151 ctdb-eventd[3467]: event_debug: 11.natgw             OK         0.005 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368158 ctdb-eventd[3467]: event_debug: 11.routing           OK         0.005 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368165 ctdb-eventd[3467]: event_debug: 13.per_ip_routing    OK         0.005 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368172 ctdb-eventd[3467]: event_debug: 20.multipathd        OK         0.005 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368179 ctdb-eventd[3467]: event_debug: 31.clamd             OK         0.005 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368186 ctdb-eventd[3467]: event_debug: 40.vsftpd            OK         0.005 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368193 ctdb-eventd[3467]: event_debug: 41.httpd             OK         0.006 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368200 ctdb-eventd[3467]: event_debug: 49.winbind           OK         0.005 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368207 ctdb-eventd[3467]: event_debug: 50.samba             OK         0.008 Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368214 ctdb-eventd[3467]: event_debug: 60.nfs               TIMEDOUT   Wed Aug 15 13:03:38 2018
2018/08/15 13:04:08.368221 ctdb-eventd[3467]: event_debug:   OUTPUT:              
2018/08/15 13:04:08.368228 ctdb-eventd[3467]: event_debug: ===== End of hung script debug for PID="27179", event="monitor" =====
2018/08/15 13:04:33.503924 ctdb-eventd[3467]: 60.nfs: ERROR: nfs failed RPC check:             
2018/08/15 13:04:33.504020 ctdb-eventd[3467]: 60.nfs: rpcinfo: RPC: Timed out
2018/08/15 13:04:33.504049 ctdb-eventd[3467]: 60.nfs: program 100003 version 3 is not available
2018/08/15 13:04:33.504083 ctdb-eventd[3467]: monitor event failed