[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

Fri Jan 29 19:11:50 UTC 2021

Thank you.

We reproduced the problem after force-killing one of the 3 physical
nodes 6 times in a row.

At that point, the grub2 loaded off the qemu virtual hard drive, but
could not find partitions. Since there is random luck involved, we don't
actually know if it was the force-killing that caused it to stop
working.

When I start the VM with the image in this state, there is nothing
interesting in the fuse log for the volume in /var/log/glusterfs on the
node hosting the image.

No pending heals (all servers report 0 entries to heal).

The same VM behavior happens on all the physical nodes when I try to
start with the same VM image.

Something from the gluster fuse mount log from earlier shows:

[2021-01-28 21:24:40.814227] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from adminvm-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-adminvm-client-0: changing port to 49152 (from 0)
[2021-01-28 21:24:43.815833] I [MSGID: 114057] [client-handshake.c:1376:select_server_supported_programs] 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400)
[2021-01-28 21:24:43.817682] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected to adminvm-client-0, attached to remote volume '/data/brick_adminvm'.
[2021-01-28 21:24:43.817709] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open - Delaying child_up until they are re-opened
[2021-01-28 21:24:43.895163] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 0-adminvm-client-0:  (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640]

But that was a long time ago.

Brick logs have an entry from when I first started the vm today (the
problem was reproduced yesterday) all brick logs have something similar.
Nothing appeared on the several other startup attempts of the VM:

[2021-01-28 21:24:45.460147] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
[2021-01-29 18:54:45.455558] I [addr.c:54:compare_addr_and_update] 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153"
[2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144
[2021-01-29 18:54:45.455815] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
[2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data available)
[2021-01-29 18:54:45.494994] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
[2021-01-29 18:54:45.495091] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0

Like before, if I halt the VM, kpartx the image, mount the giant root
within the image, then unmount, unkpartx, and start the VM - it works:

nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img
nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt
nano-2:/var/log/glusterfs # dmesg|tail -3
[85528.602570] loop: module loaded
[85535.975623] EXT4-fs (dm-3): recovery complete
[85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
nano-2:/var/log/glusterfs # umount /mnt
nano-2:/var/log/glusterfs # kpartx -d /adminvm/images/adminvm.img
loop deleted : /dev/loop0

VM WORKS for ONE boot cycle on one physical!

nano-2:/var/log/glusterfs # virsh start adminvm

However, this will work for a boot but later it will stop working again.
(INCLUDING the physical node that booted once ok. The next boot fails
again as does luanching it on the other two).

Based on feedback, I will not change the shard size at this time and
will leave that for later. Some people suggest larger sizes but it isn't
a universal suggestion. I'll also not attempt to make a logical volume
out of a group of smaller images as I think it should work like this.
Those are things I will try later if I run out of runway. Since we want
a solution to deploy to sites, this would increase the maintenance of
the otherwise simple solution.

I am leaving the state like this and will now proceed to update to the
latest gluster 7.

I will report back after I get everything updated and services restarted
with the newer version.

THANKS FOR ALL THE HELP SO FAR!!

Erik

On Wed, Jan 27, 2021 at 10:55:50PM +0300, Mahdi Adnan wrote:
>  I would leave it on 64M in volumes with spindle disks, but with SSD volumes, I
> would increase it to 128M or even 256M, but it varies from one workload to
> another.
> On Wed, Jan 27, 2021 at 10:02 PM Erik Jacobson <erik.jacobson at hpe.com> wrote:
> 
>     > Also, I would like to point that I have VMs with large disks 1TB and 2TB,
>     and
>     > have no issues. definitely would upgrade Gluster version like let's say
>     at
>     > least 7.9.
> 
>     Great! Thank you! We can update but it's very sensitive due to the
>     workload. I can't officially update our gluster until we have a cluster
>     with a couple thousand nodes to test with. However, for this problem,
>     this is on my list on the test machine. I'm hoping I can reproduce it. So
>     far
>     no luck making it happen again. Once I hit it, I will try to collect more
>     data
>     and at the end update gluster.
> 
>     What do you think about the suggestion to increase the shard size? Are
>     you using the default size on your 1TB and 2TB images?
> 
>     > Amar also asked a question regarding enabling Sharding in the volume
>     after
>     > creating the VMs disks, which would certainly mess up the volume if that
>     what
>     > happened.
> 
>     Oh I missed this question. I basically scripted it quick since I was
>     doing it so often.. I have a similar script that takes it away to start
>     over.
> 
>     set -x
>     pdsh -g gluster mkdir /data/brick_adminvm/
>     gluster volume create adminvm replica 3 transport tcp 172.23.255.151:/data/
>     brick_adminvm 172.23.255.152:/data/brick_adminvm 172.23.255.153:/data/
>     brick_adminvm
>     gluster volume set adminvm group virt
>     gluster volume set adminvm granular-entry-heal enable
>     gluster volume set adminvm storage.owner-uid 439
>     gluster volume set adminvm storage.owner-gid 443
>     gluster volume start adminvm
> 
>     pdsh -g gluster mount /adminvm
> 
>     echo -n "press enter to continue for restore tarball"
> 
>     pushd /adminvm
>     tar xvf /root/backup.tar
>     popd
> 
>     echo -n "press enter to continue for qemu-img"
> 
>     pushd /adminvm
>     qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img
>     5T
>     popd
> 
> 
>     Thanks again for the kind responses,
> 
>     Erik
> 
>     >
>     > On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson <erik.jacobson at hpe.com>
>     wrote:
>     >
>     >     > > Shortly after the sharded volume is made, there are some fuse
>     mount
>     >     > > messages. I'm not 100% sure if this was just before or during the
>     >     > > big qemu-img command to make the 5T image
>     >     > > (qemu-img create -f raw -o preallocation=falloc
>     >     > > /adminvm/images/adminvm.img 5T)
>     >     > Any reason to have a single disk with this size ?
>     >
>     >     > Usually in any
>     >     > virtualization I have used , it is always recommended to keep it
>     lower.
>     >     > Have you thought about multiple disks with smaller size ?
>     >
>     >     Yes, because the actual virtual machine is an admin node/head node
>     cluster
>     >     manager for a supercomputer that hosts big OS images and drives
>     >     multi-thousand-node-clusters (boot, monitoring, image creation,
>     >     distribution, sometimes NFS roots, etc) . So this VM is a biggie.
>     >
>     >     We could make multiple smaller images but it would be very painful
>     since
>     >     it differs from the normal non-VM setup.
>     >
>     >     So unlike many solutions where you have lots of small VMs with their
>     >     images small images, this solution is one giant VM with one giant
>     image.
>     >     We're essentially using gluster in this use case (as opposed to
>     others I
>     >     have posted about in the past) for head node failover (combined with
>     >     pacemaker).
>     >
>     >     > Also worth
>     >     > noting is that RHII is supported only when the shard size is 
>     512MB, so
>     >     > it's worth trying bigger shard size .
>     >
>     >     I have put larger shard size and newer gluster version on the list to
>     >     try. Thank you! Hoping to get it failing again to try these things!
>     >
>     >
>     >
>     > --
>     > Respectfully
>     > Mahdi
> 
> 
> 
> --
> Respectfully
> Mahdi