<div dir='auto'><div><br><div class="gmail_extra"><br><div class="gmail_quote">Den 15 aug. 2018 13:14 skrev Karli Sjöberg &lt;karli@inparadise.se&gt;:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>On Wed, 2018-08-15 at 13:42 +0800, Pui Edylie wrote:<br>&gt; Hi Karli,<br>&gt; <br>&gt; I think Alex is right in regards with the NFS version and state.<br>&gt; <br>&gt; I am only using NFSv3 and the failover is working per expectation.<br><br>OK, so I've remade the test again and it goes like this:<br><br>1) Start copy loop[*]<br>2) Power off hv02<br>3) Copy loop stalls indefinitely<br><br>I have attached a snippet of the ctdb log that looks interesting but<br>doesn't say much to me execpt that something's wrong:)<br><br>[*]: while true; do mount -o vers=3 hv03v.localdomain:/data /mnt/; dd<br>if=/var/tmp/test.bin of=/mnt/test.bin bs=1M status=progress; rm -fv<br>/mnt/test.bin; umount /mnt; done<br><br>Thanks in advance!<br><br>/K<br></div></blockquote></div></div></div><div dir="auto"><br></div><div dir="auto"><div dir="auto" style="font-family: sans-serif;">Could someone just confirm to me if this is the correct result for this scenario?</div><div dir="auto" style="font-family: sans-serif;"><br></div><div dir="auto" style="font-family: sans-serif;">Aren't you supposed to be able to reboot a host in the cluster without compromising it?</div><div dir="auto" style="font-family: sans-serif;"><br></div><div dir="auto" style="font-family: sans-serif;">/K</div><div dir="auto" style="font-family: sans-serif;"><br></div></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br>&gt; <br>&gt; In my use case, I have 3 nodes with ESXI 6.7 as OS and setup 1x <br>&gt; gluster VM on each of the ESXI host using its local datastore.<br>&gt; <br>&gt; Once I have formed the replicate 3, I use the CTDB VIP to present the<br>&gt; NFS3 back to the Vcenter and uses it as a shared storage.<br>&gt; <br>&gt; Everything works great other than performance is not very good ... I<br>&gt; am still looking for ways to improve it.<br>&gt; <br>&gt; Cheers,<br>&gt; Edy<br>&gt; <br>&gt; On 8/15/2018 12:25 AM, Alex Chekholko wrote:<br>&gt; &gt; Hi Karli,<br>&gt; &gt; <br>&gt; &gt; I'm not 100% sure this is related, but when I set up my ZFS NFS HA<br>&gt; &gt; per https://github.com/ewwhite/zfs-ha/wiki I was not able to get<br>&gt; &gt; the failover to work with NFS v4 but only with NFS v3.<br>&gt; &gt; <br>&gt; &gt; From the client point of view, it really looked like with NFS v4<br>&gt; &gt; there is an open file handle and that just goes stale and hangs, or<br>&gt; &gt; something like that, whereas with NFSv3 the client retries and<br>&gt; &gt; recovers and continues.  I did not investigate further, I just use<br>&gt; &gt; v3.  I think it has something to do with NFSv4 being "stateful" and<br>&gt; &gt; NFSv3 being "stateless".<br>&gt; &gt; <br>&gt; &gt; Can you re-run your test but using NFSv3 on the client mount?  Or<br>&gt; &gt; do you need to use v4.x?<br>&gt; &gt; <br>&gt; &gt; Regards,<br>&gt; &gt; Alex<br>&gt; &gt; <br>&gt; &gt; On Tue, Aug 14, 2018 at 6:11 AM Karli Sjöberg <br>&gt; &gt; wrote:<br>&gt; &gt; &gt; On Fri, 2018-08-10 at 09:39 -0400, Kaleb S. KEITHLEY wrote:<br>&gt; &gt; &gt; &gt; On 08/10/2018 09:23 AM, Karli Sjöberg wrote:<br>&gt; &gt; &gt; &gt; &gt; On Fri, 2018-08-10 at 21:23 +0800, Pui Edylie wrote:<br>&gt; &gt; &gt; &gt; &gt; &gt; Hi Karli,<br>&gt; &gt; &gt; &gt; &gt; &gt; <br>&gt; &gt; &gt; &gt; &gt; &gt; Storhaug works with glusterfs 4.1.2 and latest nfs-ganesha.<br>&gt; &gt; &gt; &gt; &gt; &gt; <br>&gt; &gt; &gt; &gt; &gt; &gt; I just installed them last weekend ... they are working<br>&gt; &gt; &gt; very well<br>&gt; &gt; &gt; &gt; &gt; &gt; :)<br>&gt; &gt; &gt; &gt; &gt; <br>&gt; &gt; &gt; &gt; &gt; Okay, awesome!<br>&gt; &gt; &gt; &gt; &gt; <br>&gt; &gt; &gt; &gt; &gt; Is there any documentation on how to do that?<br>&gt; &gt; &gt; &gt; &gt; <br>&gt; &gt; &gt; &gt; <br>&gt; &gt; &gt; &gt; https://github.com/gluster/storhaug/wiki<br>&gt; &gt; &gt; &gt; <br>&gt; &gt; &gt; <br>&gt; &gt; &gt; Thanks Kaleb and Edy!<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; I have now redone the cluster using the latest and greatest<br>&gt; &gt; &gt; following<br>&gt; &gt; &gt; the above guide and repeated the same test I was doing before<br>&gt; &gt; &gt; (the<br>&gt; &gt; &gt; rsync while loop) with success. I let (forgot) it run for about a<br>&gt; &gt; &gt; day<br>&gt; &gt; &gt; and it was still chugging along nicely when I aborted it, so<br>&gt; &gt; &gt; success<br>&gt; &gt; &gt; there!<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; On to the next test; the catastrophic failure test- where one of<br>&gt; &gt; &gt; the<br>&gt; &gt; &gt; servers dies, I'm having a more difficult time with.<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; 1) I start with mounting the share over NFS 4.1 and then proceed<br>&gt; &gt; &gt; with<br>&gt; &gt; &gt; writing a 8 GiB large random data file with 'dd', while "hard-<br>&gt; &gt; &gt; cutting"<br>&gt; &gt; &gt; the power to the server I'm writing to, the transfer just stops<br>&gt; &gt; &gt; indefinitely, until the server comes back again. Is that supposed<br>&gt; &gt; &gt; to<br>&gt; &gt; &gt; happen? Like this:<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; # dd if=/dev/urandom of=/var/tmp/test.bin bs=1M count=8192<br>&gt; &gt; &gt; # mount -o vers=4.1 hv03v.localdomain:/data /mnt/<br>&gt; &gt; &gt; # dd if=/var/tmp/test.bin of=/mnt/test.bin bs=1M status=progress<br>&gt; &gt; &gt; 2434793472 bytes (2,4 GB, 2,3 GiB) copied, 42 s, 57,9 MB/s<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; (here I cut the power and let it be for almost two hours before<br>&gt; &gt; &gt; turning<br>&gt; &gt; &gt; it on again)<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; dd: error writing '/mnt/test.bin': Remote I/O error<br>&gt; &gt; &gt; 2325+0 records in<br>&gt; &gt; &gt; 2324+0 records out<br>&gt; &gt; &gt; 2436890624 bytes (2,4 GB, 2,3 GiB) copied, 6944,84 s, 351 kB/s<br>&gt; &gt; &gt; # umount /mnt<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; Here the unmount command hung and I had to hard reset the client.<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; 2) Another question I have is why some files "change" as you copy<br>&gt; &gt; &gt; them<br>&gt; &gt; &gt; out to the Gluster storage? Is that the way it should be? This<br>&gt; &gt; &gt; time, I<br>&gt; &gt; &gt; deleted eveything in the destination directory to start over:<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; # mount -o vers=4.1 hv03v.localdomain:/data /mnt/<br>&gt; &gt; &gt; # rm -f /mnt/test.bin<br>&gt; &gt; &gt; # dd if=/var/tmp/test.bin of=/mnt/test.bin bs=1M status=progress<br>&gt; &gt; &gt; 8557428736 bytes (8,6 GB, 8,0 GiB) copied, 122 s, 70,1 MB/s<br>&gt; &gt; &gt; 8192+0 records in<br>&gt; &gt; &gt; 8192+0 records out<br>&gt; &gt; &gt; 8589934592 bytes (8,6 GB, 8,0 GiB) copied, 123,039 s, 69,8 MB/s<br>&gt; &gt; &gt; # md5sum /var/tmp/test.bin <br>&gt; &gt; &gt; 073867b68fa8eaa382ffe05adb90b583  /var/tmp/test.bin<br>&gt; &gt; &gt; # md5sum /mnt/test.bin <br>&gt; &gt; &gt; 634187d367f856f3f5fb31846f796397  /mnt/test.bin<br>&gt; &gt; &gt; # umount /mnt<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; Thanks in advance!<br>&gt; &gt; &gt; <br>&gt; &gt; &gt; /K<br>&gt; &gt; &gt; _______________________________________________<br>&gt; &gt; &gt; Gluster-users mailing list<br>&gt; &gt; &gt; Gluster-users@gluster.org<br>&gt; &gt; &gt; https://lists.gluster.org/mailman/listinfo/gluster-users<br>&gt;  _______________________________________________<br>Gluster-users mailing list<br>Gluster-users@gluster.org<br>https://lists.gluster.org/mailman/listinfo/gluster-users</div></blockquote></div><br></div></div></div>