[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

Fri Jan 29 20:20:56 UTC 2021

I updated to 7.9, rebooted everything, and it started working.

I will have QE try to break it again and report back. I couldn't break
it but they're better at breaking things (which is hard to imagine :)

On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote:
> Thank you.
> 
> We reproduced the problem after force-killing one of the 3 physical
> nodes 6 times in a row.
> 
> At that point, the grub2 loaded off the qemu virtual hard drive, but
> could not find partitions. Since there is random luck involved, we don't
> actually know if it was the force-killing that caused it to stop
> working.
> 
> When I start the VM with the image in this state, there is nothing
> interesting in the fuse log for the volume in /var/log/glusterfs on the
> node hosting the image.
> 
> No pending heals (all servers report 0 entries to heal).
> 
> The same VM behavior happens on all the physical nodes when I try to
> start with the same VM image.
> 
> Something from the gluster fuse mount log from earlier shows:
> 
> [2021-01-28 21:24:40.814227] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from adminvm-client-0. Client process will keep trying to connect to glusterd until brick's port is available
> [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-adminvm-client-0: changing port to 49152 (from 0)
> [2021-01-28 21:24:43.815833] I [MSGID: 114057] [client-handshake.c:1376:select_server_supported_programs] 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400)
> [2021-01-28 21:24:43.817682] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected to adminvm-client-0, attached to remote volume '/data/brick_adminvm'.
> [2021-01-28 21:24:43.817709] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open - Delaying child_up until they are re-opened
> [2021-01-28 21:24:43.895163] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
> The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 0-adminvm-client-0:  (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640]
> 
> 
> But that was a long time ago.
> 
> Brick logs have an entry from when I first started the vm today (the
> problem was reproduced yesterday) all brick logs have something similar.
> Nothing appeared on the several other startup attempts of the VM:
> 
> [2021-01-28 21:24:45.460147] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
> [2021-01-29 18:54:45.455558] I [addr.c:54:compare_addr_and_update] 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153"
> [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144
> [2021-01-29 18:54:45.455815] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
> [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data available)
> [2021-01-29 18:54:45.494994] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> [2021-01-29 18:54:45.495091] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> 
> 
> 
> Like before, if I halt the VM, kpartx the image, mount the giant root
> within the image, then unmount, unkpartx, and start the VM - it works:
> 
> nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img
> nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt
> nano-2:/var/log/glusterfs # dmesg|tail -3
> [85528.602570] loop: module loaded
> [85535.975623] EXT4-fs (dm-3): recovery complete
> [85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
> nano-2:/var/log/glusterfs # umount /mnt
> nano-2:/var/log/glusterfs # kpartx -d /adminvm/images/adminvm.img
> loop deleted : /dev/loop0
> 
> VM WORKS for ONE boot cycle on one physical!
> 
> nano-2:/var/log/glusterfs # virsh start adminvm
> 
> However, this will work for a boot but later it will stop working again.
> (INCLUDING the physical node that booted once ok. The next boot fails
> again as does luanching it on the other two).
> 
> Based on feedback, I will not change the shard size at this time and
> will leave that for later. Some people suggest larger sizes but it isn't
> a universal suggestion. I'll also not attempt to make a logical volume
> out of a group of smaller images as I think it should work like this.
> Those are things I will try later if I run out of runway. Since we want
> a solution to deploy to sites, this would increase the maintenance of
> the otherwise simple solution.
> 
> I am leaving the state like this and will now proceed to update to the
> latest gluster 7.
> 
> I will report back after I get everything updated and services restarted
> with the newer version.
> 
> THANKS FOR ALL THE HELP SO FAR!!
> 
> Erik
> 
> On Wed, Jan 27, 2021 at 10:55:50PM +0300, Mahdi Adnan wrote:
> >  I would leave it on 64M in volumes with spindle disks, but with SSD volumes, I
> > would increase it to 128M or even 256M, but it varies from one workload to
> > another.
> > On Wed, Jan 27, 2021 at 10:02 PM Erik Jacobson <erik.jacobson at hpe.com> wrote:
> > 
> >     > Also, I would like to point that I have VMs with large disks 1TB and 2TB,
> >     and
> >     > have no issues. definitely would upgrade Gluster version like let's say
> >     at
> >     > least 7.9.
> > 
> >     Great! Thank you! We can update but it's very sensitive due to the
> >     workload. I can't officially update our gluster until we have a cluster
> >     with a couple thousand nodes to test with. However, for this problem,
> >     this is on my list on the test machine. I'm hoping I can reproduce it. So
> >     far
> >     no luck making it happen again. Once I hit it, I will try to collect more
> >     data
> >     and at the end update gluster.
> > 
> >     What do you think about the suggestion to increase the shard size? Are
> >     you using the default size on your 1TB and 2TB images?
> > 
> >     > Amar also asked a question regarding enabling Sharding in the volume
> >     after
> >     > creating the VMs disks, which would certainly mess up the volume if that
> >     what
> >     > happened.
> > 
> >     Oh I missed this question. I basically scripted it quick since I was
> >     doing it so often.. I have a similar script that takes it away to start
> >     over.
> > 
> >     set -x
> >     pdsh -g gluster mkdir /data/brick_adminvm/
> >     gluster volume create adminvm replica 3 transport tcp 172.23.255.151:/data/
> >     brick_adminvm 172.23.255.152:/data/brick_adminvm 172.23.255.153:/data/
> >     brick_adminvm
> >     gluster volume set adminvm group virt
> >     gluster volume set adminvm granular-entry-heal enable
> >     gluster volume set adminvm storage.owner-uid 439
> >     gluster volume set adminvm storage.owner-gid 443
> >     gluster volume start adminvm
> > 
> >     pdsh -g gluster mount /adminvm
> > 
> >     echo -n "press enter to continue for restore tarball"
> > 
> >     pushd /adminvm
> >     tar xvf /root/backup.tar
> >     popd
> > 
> >     echo -n "press enter to continue for qemu-img"
> > 
> >     pushd /adminvm
> >     qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img
> >     5T
> >     popd
> > 
> > 
> >     Thanks again for the kind responses,
> > 
> >     Erik
> > 
> >     >
> >     > On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson <erik.jacobson at hpe.com>
> >     wrote:
> >     >
> >     >     > > Shortly after the sharded volume is made, there are some fuse
> >     mount
> >     >     > > messages. I'm not 100% sure if this was just before or during the
> >     >     > > big qemu-img command to make the 5T image
> >     >     > > (qemu-img create -f raw -o preallocation=falloc
> >     >     > > /adminvm/images/adminvm.img 5T)
> >     >     > Any reason to have a single disk with this size ?
> >     >
> >     >     > Usually in any
> >     >     > virtualization I have used , it is always recommended to keep it
> >     lower.
> >     >     > Have you thought about multiple disks with smaller size ?
> >     >
> >     >     Yes, because the actual virtual machine is an admin node/head node
> >     cluster
> >     >     manager for a supercomputer that hosts big OS images and drives
> >     >     multi-thousand-node-clusters (boot, monitoring, image creation,
> >     >     distribution, sometimes NFS roots, etc) . So this VM is a biggie.
> >     >
> >     >     We could make multiple smaller images but it would be very painful
> >     since
> >     >     it differs from the normal non-VM setup.
> >     >
> >     >     So unlike many solutions where you have lots of small VMs with their
> >     >     images small images, this solution is one giant VM with one giant
> >     image.
> >     >     We're essentially using gluster in this use case (as opposed to
> >     others I
> >     >     have posted about in the past) for head node failover (combined with
> >     >     pacemaker).
> >     >
> >     >     > Also worth
> >     >     > noting is that RHII is supported only when the shard size is 
> >     512MB, so
> >     >     > it's worth trying bigger shard size .
> >     >
> >     >     I have put larger shard size and newer gluster version on the list to
> >     >     try. Thank you! Hoping to get it failing again to try these things!
> >     >
> >     >
> >     >
> >     > --
> >     > Respectfully
> >     > Mahdi
> > 
> > 
> > 
> > --
> > Respectfully
> > Mahdi