<div dir="ltr"><span style="color:rgb(0,0,0)">Thanks for the feedback, 7.9 is really stable, in fact, it is so stable that we might not even upgrade to 8.x for some time.</span><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Feb 1, 2021 at 11:56 PM Erik Jacobson <<a href="mailto:erik.jacobson@hpe.com">erik.jacobson@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">We think this fixed it. While there is random chance in there, we can't<br>
repeat it in 7.9. So I'll close this thread out for now.<br>
<br>
We'll ask for help again if needed. Thanks for all the kind responses,<br>
<br>
Erik<br>
<br>
On Fri, Jan 29, 2021 at 02:20:56PM -0600, Erik Jacobson wrote:<br>
> I updated to 7.9, rebooted everything, and it started working.<br>
> <br>
> I will have QE try to break it again and report back. I couldn't break<br>
> it but they're better at breaking things (which is hard to imagine :)<br>
> <br>
> <br>
> On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote:<br>
> > Thank you.<br>
> > <br>
> > We reproduced the problem after force-killing one of the 3 physical<br>
> > nodes 6 times in a row.<br>
> > <br>
> > At that point, the grub2 loaded off the qemu virtual hard drive, but<br>
> > could not find partitions. Since there is random luck involved, we don't<br>
> > actually know if it was the force-killing that caused it to stop<br>
> > working.<br>
> > <br>
> > When I start the VM with the image in this state, there is nothing<br>
> > interesting in the fuse log for the volume in /var/log/glusterfs on the<br>
> > node hosting the image.<br>
> > <br>
> > No pending heals (all servers report 0 entries to heal).<br>
> > <br>
> > The same VM behavior happens on all the physical nodes when I try to<br>
> > start with the same VM image.<br>
> > <br>
> > Something from the gluster fuse mount log from earlier shows:<br>
> > <br>
> > [2021-01-28 21:24:40.814227] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from adminvm-client-0. Client process will keep trying to connect to glusterd until brick's port is available<br>
> > [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-adminvm-client-0: changing port to 49152 (from 0)<br>
> > [2021-01-28 21:24:43.815833] I [MSGID: 114057] [client-handshake.c:1376:select_server_supported_programs] 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400)<br>
> > [2021-01-28 21:24:43.817682] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected to adminvm-client-0, attached to remote volume '/data/brick_adminvm'.<br>
> > [2021-01-28 21:24:43.817709] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open - Delaying child_up until they are re-opened<br>
> > [2021-01-28 21:24:43.895163] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP<br>
> > The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 0-adminvm-client-0: (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640]<br>
> > <br>
> > <br>
> > But that was a long time ago.<br>
> > <br>
> > Brick logs have an entry from when I first started the vm today (the<br>
> > problem was reproduced yesterday) all brick logs have something similar.<br>
> > Nothing appeared on the several other startup attempts of the VM:<br>
> > <br>
> > [2021-01-28 21:24:45.460147] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm<br>
> > [2021-01-29 18:54:45.455558] I [addr.c:54:compare_addr_and_update] 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153"<br>
> > [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144<br>
> > [2021-01-29 18:54:45.455815] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm<br>
> > [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 0-tcp.adminvm-server: readv on <a href="http://172.23.255.153:48551" rel="noreferrer" target="_blank">172.23.255.153:48551</a> failed (No data available)<br>
> > [2021-01-29 18:54:45.494994] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0<br>
> > [2021-01-29 18:54:45.495091] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0<br>
> > <br>
> > <br>
> > <br>
> > Like before, if I halt the VM, kpartx the image, mount the giant root<br>
> > within the image, then unmount, unkpartx, and start the VM - it works:<br>
> > <br>
> > nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img<br>
> > nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt<br>
> > nano-2:/var/log/glusterfs # dmesg|tail -3<br>
> > [85528.602570] loop: module loaded<br>
> > [85535.975623] EXT4-fs (dm-3): recovery complete<br>
> > [85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)<br>
> > nano-2:/var/log/glusterfs # umount /mnt<br>
> > nano-2:/var/log/glusterfs # kpartx -d /adminvm/images/adminvm.img<br>
> > loop deleted : /dev/loop0<br>
> > <br>
> > VM WORKS for ONE boot cycle on one physical!<br>
> > <br>
> > nano-2:/var/log/glusterfs # virsh start adminvm<br>
> > <br>
> > However, this will work for a boot but later it will stop working again.<br>
> > (INCLUDING the physical node that booted once ok. The next boot fails<br>
> > again as does luanching it on the other two).<br>
> > <br>
> > Based on feedback, I will not change the shard size at this time and<br>
> > will leave that for later. Some people suggest larger sizes but it isn't<br>
> > a universal suggestion. I'll also not attempt to make a logical volume<br>
> > out of a group of smaller images as I think it should work like this.<br>
> > Those are things I will try later if I run out of runway. Since we want<br>
> > a solution to deploy to sites, this would increase the maintenance of<br>
> > the otherwise simple solution.<br>
> > <br>
> > I am leaving the state like this and will now proceed to update to the<br>
> > latest gluster 7.<br>
> > <br>
> > I will report back after I get everything updated and services restarted<br>
> > with the newer version.<br>
> > <br>
> > THANKS FOR ALL THE HELP SO FAR!!<br>
> > <br>
> > Erik<br>
> > <br>
> > On Wed, Jan 27, 2021 at 10:55:50PM +0300, Mahdi Adnan wrote:<br>
> > > I would leave it on 64M in volumes with spindle disks, but with SSD volumes, I<br>
> > > would increase it to 128M or even 256M, but it varies from one workload to<br>
> > > another.<br>
> > > On Wed, Jan 27, 2021 at 10:02 PM Erik Jacobson <<a href="mailto:erik.jacobson@hpe.com" target="_blank">erik.jacobson@hpe.com</a>> wrote:<br>
> > > <br>
> > > > Also, I would like to point that I have VMs with large disks 1TB and 2TB,<br>
> > > and<br>
> > > > have no issues. definitely would upgrade Gluster version like let's say<br>
> > > at<br>
> > > > least 7.9.<br>
> > > <br>
> > > Great! Thank you! We can update but it's very sensitive due to the<br>
> > > workload. I can't officially update our gluster until we have a cluster<br>
> > > with a couple thousand nodes to test with. However, for this problem,<br>
> > > this is on my list on the test machine. I'm hoping I can reproduce it. So<br>
> > > far<br>
> > > no luck making it happen again. Once I hit it, I will try to collect more<br>
> > > data<br>
> > > and at the end update gluster.<br>
> > > <br>
> > > What do you think about the suggestion to increase the shard size? Are<br>
> > > you using the default size on your 1TB and 2TB images?<br>
> > > <br>
> > > > Amar also asked a question regarding enabling Sharding in the volume<br>
> > > after<br>
> > > > creating the VMs disks, which would certainly mess up the volume if that<br>
> > > what<br>
> > > > happened.<br>
> > > <br>
> > > Oh I missed this question. I basically scripted it quick since I was<br>
> > > doing it so often.. I have a similar script that takes it away to start<br>
> > > over.<br>
> > > <br>
> > > set -x<br>
> > > pdsh -g gluster mkdir /data/brick_adminvm/<br>
> > > gluster volume create adminvm replica 3 transport tcp 172.23.255.151:/data/<br>
> > > brick_adminvm 172.23.255.152:/data/brick_adminvm 172.23.255.153:/data/<br>
> > > brick_adminvm<br>
> > > gluster volume set adminvm group virt<br>
> > > gluster volume set adminvm granular-entry-heal enable<br>
> > > gluster volume set adminvm storage.owner-uid 439<br>
> > > gluster volume set adminvm storage.owner-gid 443<br>
> > > gluster volume start adminvm<br>
> > > <br>
> > > pdsh -g gluster mount /adminvm<br>
> > > <br>
> > > echo -n "press enter to continue for restore tarball"<br>
> > > <br>
> > > pushd /adminvm<br>
> > > tar xvf /root/backup.tar<br>
> > > popd<br>
> > > <br>
> > > echo -n "press enter to continue for qemu-img"<br>
> > > <br>
> > > pushd /adminvm<br>
> > > qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img<br>
> > > 5T<br>
> > > popd<br>
> > > <br>
> > > <br>
> > > Thanks again for the kind responses,<br>
> > > <br>
> > > Erik<br>
> > > <br>
> > > ><br>
> > > > On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson <<a href="mailto:erik.jacobson@hpe.com" target="_blank">erik.jacobson@hpe.com</a>><br>
> > > wrote:<br>
> > > ><br>
> > > > > > Shortly after the sharded volume is made, there are some fuse<br>
> > > mount<br>
> > > > > > messages. I'm not 100% sure if this was just before or during the<br>
> > > > > > big qemu-img command to make the 5T image<br>
> > > > > > (qemu-img create -f raw -o preallocation=falloc<br>
> > > > > > /adminvm/images/adminvm.img 5T)<br>
> > > > > Any reason to have a single disk with this size ?<br>
> > > ><br>
> > > > > Usually in any<br>
> > > > > virtualization I have used , it is always recommended to keep it<br>
> > > lower.<br>
> > > > > Have you thought about multiple disks with smaller size ?<br>
> > > ><br>
> > > > Yes, because the actual virtual machine is an admin node/head node<br>
> > > cluster<br>
> > > > manager for a supercomputer that hosts big OS images and drives<br>
> > > > multi-thousand-node-clusters (boot, monitoring, image creation,<br>
> > > > distribution, sometimes NFS roots, etc) . So this VM is a biggie.<br>
> > > ><br>
> > > > We could make multiple smaller images but it would be very painful<br>
> > > since<br>
> > > > it differs from the normal non-VM setup.<br>
> > > ><br>
> > > > So unlike many solutions where you have lots of small VMs with their<br>
> > > > images small images, this solution is one giant VM with one giant<br>
> > > image.<br>
> > > > We're essentially using gluster in this use case (as opposed to<br>
> > > others I<br>
> > > > have posted about in the past) for head node failover (combined with<br>
> > > > pacemaker).<br>
> > > ><br>
> > > > > Also worth<br>
> > > > > noting is that RHII is supported only when the shard size is <br>
> > > 512MB, so<br>
> > > > > it's worth trying bigger shard size .<br>
> > > ><br>
> > > > I have put larger shard size and newer gluster version on the list to<br>
> > > > try. Thank you! Hoping to get it failing again to try these things!<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > --<br>
> > > > Respectfully<br>
> > > > Mahdi<br>
> > > <br>
> > > <br>
> > > <br>
> > > --<br>
> > > Respectfully<br>
> > > Mahdi<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Respectfully<div>Mahdi</div></div></div>