[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM
erik.jacobson at hpe.com
Fri Jan 29 20:20:56 UTC 2021
I updated to 7.9, rebooted everything, and it started working.
I will have QE try to break it again and report back. I couldn't break
it but they're better at breaking things (which is hard to imagine :)
On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote:
> Thank you.
> We reproduced the problem after force-killing one of the 3 physical
> nodes 6 times in a row.
> At that point, the grub2 loaded off the qemu virtual hard drive, but
> could not find partitions. Since there is random luck involved, we don't
> actually know if it was the force-killing that caused it to stop
> When I start the VM with the image in this state, there is nothing
> interesting in the fuse log for the volume in /var/log/glusterfs on the
> node hosting the image.
> No pending heals (all servers report 0 entries to heal).
> The same VM behavior happens on all the physical nodes when I try to
> start with the same VM image.
> Something from the gluster fuse mount log from earlier shows:
> [2021-01-28 21:24:40.814227] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from adminvm-client-0. Client process will keep trying to connect to glusterd until brick's port is available
> [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-adminvm-client-0: changing port to 49152 (from 0)
> [2021-01-28 21:24:43.815833] I [MSGID: 114057] [client-handshake.c:1376:select_server_supported_programs] 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400)
> [2021-01-28 21:24:43.817682] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected to adminvm-client-0, attached to remote volume '/data/brick_adminvm'.
> [2021-01-28 21:24:43.817709] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open - Delaying child_up until they are re-opened
> [2021-01-28 21:24:43.895163] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
> The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 0-adminvm-client-0: (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640]
> But that was a long time ago.
> Brick logs have an entry from when I first started the vm today (the
> problem was reproduced yesterday) all brick logs have something similar.
> Nothing appeared on the several other startup attempts of the VM:
> [2021-01-28 21:24:45.460147] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
> [2021-01-29 18:54:45.455558] I [addr.c:54:compare_addr_and_update] 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153"
> [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144
> [2021-01-29 18:54:45.455815] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
> [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data available)
> [2021-01-29 18:54:45.494994] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> [2021-01-29 18:54:45.495091] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> Like before, if I halt the VM, kpartx the image, mount the giant root
> within the image, then unmount, unkpartx, and start the VM - it works:
> nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img
> nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt
> nano-2:/var/log/glusterfs # dmesg|tail -3
> [85528.602570] loop: module loaded
> [85535.975623] EXT4-fs (dm-3): recovery complete
> [85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
> nano-2:/var/log/glusterfs # umount /mnt
> nano-2:/var/log/glusterfs # kpartx -d /adminvm/images/adminvm.img
> loop deleted : /dev/loop0
> VM WORKS for ONE boot cycle on one physical!
> nano-2:/var/log/glusterfs # virsh start adminvm
> However, this will work for a boot but later it will stop working again.
> (INCLUDING the physical node that booted once ok. The next boot fails
> again as does luanching it on the other two).
> Based on feedback, I will not change the shard size at this time and
> will leave that for later. Some people suggest larger sizes but it isn't
> a universal suggestion. I'll also not attempt to make a logical volume
> out of a group of smaller images as I think it should work like this.
> Those are things I will try later if I run out of runway. Since we want
> a solution to deploy to sites, this would increase the maintenance of
> the otherwise simple solution.
> I am leaving the state like this and will now proceed to update to the
> latest gluster 7.
> I will report back after I get everything updated and services restarted
> with the newer version.
> THANKS FOR ALL THE HELP SO FAR!!
> On Wed, Jan 27, 2021 at 10:55:50PM +0300, Mahdi Adnan wrote:
> > I would leave it on 64M in volumes with spindle disks, but with SSD volumes, I
> > would increase it to 128M or even 256M, but it varies from one workload to
> > another.
> > On Wed, Jan 27, 2021 at 10:02 PM Erik Jacobson <erik.jacobson at hpe.com> wrote:
> > > Also, I would like to point that I have VMs with large disks 1TB and 2TB,
> > and
> > > have no issues. definitely would upgrade Gluster version like let's say
> > at
> > > least 7.9.
> > Great! Thank you! We can update but it's very sensitive due to the
> > workload. I can't officially update our gluster until we have a cluster
> > with a couple thousand nodes to test with. However, for this problem,
> > this is on my list on the test machine. I'm hoping I can reproduce it. So
> > far
> > no luck making it happen again. Once I hit it, I will try to collect more
> > data
> > and at the end update gluster.
> > What do you think about the suggestion to increase the shard size? Are
> > you using the default size on your 1TB and 2TB images?
> > > Amar also asked a question regarding enabling Sharding in the volume
> > after
> > > creating the VMs disks, which would certainly mess up the volume if that
> > what
> > > happened.
> > Oh I missed this question. I basically scripted it quick since I was
> > doing it so often.. I have a similar script that takes it away to start
> > over.
> > set -x
> > pdsh -g gluster mkdir /data/brick_adminvm/
> > gluster volume create adminvm replica 3 transport tcp 172.23.255.151:/data/
> > brick_adminvm 172.23.255.152:/data/brick_adminvm 172.23.255.153:/data/
> > brick_adminvm
> > gluster volume set adminvm group virt
> > gluster volume set adminvm granular-entry-heal enable
> > gluster volume set adminvm storage.owner-uid 439
> > gluster volume set adminvm storage.owner-gid 443
> > gluster volume start adminvm
> > pdsh -g gluster mount /adminvm
> > echo -n "press enter to continue for restore tarball"
> > pushd /adminvm
> > tar xvf /root/backup.tar
> > popd
> > echo -n "press enter to continue for qemu-img"
> > pushd /adminvm
> > qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img
> > 5T
> > popd
> > Thanks again for the kind responses,
> > Erik
> > >
> > > On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson <erik.jacobson at hpe.com>
> > wrote:
> > >
> > > > > Shortly after the sharded volume is made, there are some fuse
> > mount
> > > > > messages. I'm not 100% sure if this was just before or during the
> > > > > big qemu-img command to make the 5T image
> > > > > (qemu-img create -f raw -o preallocation=falloc
> > > > > /adminvm/images/adminvm.img 5T)
> > > > Any reason to have a single disk with this size ?
> > >
> > > > Usually in any
> > > > virtualization I have used , it is always recommended to keep it
> > lower.
> > > > Have you thought about multiple disks with smaller size ?
> > >
> > > Yes, because the actual virtual machine is an admin node/head node
> > cluster
> > > manager for a supercomputer that hosts big OS images and drives
> > > multi-thousand-node-clusters (boot, monitoring, image creation,
> > > distribution, sometimes NFS roots, etc) . So this VM is a biggie.
> > >
> > > We could make multiple smaller images but it would be very painful
> > since
> > > it differs from the normal non-VM setup.
> > >
> > > So unlike many solutions where you have lots of small VMs with their
> > > images small images, this solution is one giant VM with one giant
> > image.
> > > We're essentially using gluster in this use case (as opposed to
> > others I
> > > have posted about in the past) for head node failover (combined with
> > > pacemaker).
> > >
> > > > Also worth
> > > > noting is that RHII is supported only when the shard size is
> > 512MB, so
> > > > it's worth trying bigger shard size .
> > >
> > > I have put larger shard size and newer gluster version on the list to
> > > try. Thank you! Hoping to get it failing again to try these things!
> > >
> > >
> > >
> > > --
> > > Respectfully
> > > Mahdi
> > --
> > Respectfully
> > Mahdi
More information about the Gluster-users