[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

Tue Jan 26 23:36:16 UTC 2021

Shortly after the sharded volume is made, there are some fuse mount
messages. I'm not 100% sure if this was just before or during the
big qemu-img command to make the 5T image
(qemu-img create -f raw -o preallocation=falloc
/adminvm/images/adminvm.img 5T)

(from /var/log/glusterfs/adminvm.log)
[2021-01-26 19:18:21.287697] I [fuse-bridge.c:5166:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.31
[2021-01-26 19:18:21.287719] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: switched to graph 0
[2021-01-26 19:18:23.945566] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-2: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.7 (00000000-0000-0000-0000-000000000000) [No data available]
[2021-01-26 19:18:54.089721] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.85 (00000000-0000-0000-0000-000000000000) [No data available]
[2021-01-26 19:18:54.089784] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-1: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.85 (00000000-0000-0000-0000-000000000000) [No data available]
[2021-01-26 19:18:55.048613] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-1: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.88 (00000000-0000-0000-0000-000000000000) [No data available]
[2021-01-26 19:18:55.355131] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.89 (00000000-0000-0000-0000-000000000000) [No data available]
[2021-01-26 19:18:55.981094] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.91 (00000000-0000-0000-0000-000000000000) [No data available]
......

Towards the end (or just after, it's hard to tell) of the qemu-img
create command, these msgs showed up in the adminvm.log. I just supplied
the first few. There were many:

[2021-01-26 19:28:40.652898] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/48bb5288-e27e-46c9-9f7c-944a804df361.1: dentry not found in 48bb5288-e27e-46c9-9f7c-944a804df361
[2021-01-26 19:28:40.652975] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/931508ed-9368-4982-a53e-7187a9f0c1f9.3: dentry not found in 931508ed-9368-4982-a53e-7187a9f0c1f9
[2021-01-26 19:28:40.653047] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/e808ecab-2e70-4ef3-954e-ce1b78ed8b52.4: dentry not found in e808ecab-2e70-4ef3-954e-ce1b78ed8b52
[2021-01-26 19:28:40.653102] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/2c62c383-d869-4655-9c03-f08a86a874ba.6: dentry not found in 2c62c383-d869-4655-9c03-f08a86a874ba
[2021-01-26 19:28:40.653169] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/556ffbc9-bcbe-445a-93f5-13784c5a6df1.2: dentry not found in 556ffbc9-bcbe-445a-93f5-13784c5a6df1
[2021-01-26 19:28:40.653218] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/5d414e7c-335d-40da-bb96-6c427181338b.5: dentry not found in 5d414e7c-335d-40da-bb96-6c427181338b
[2021-01-26 19:28:40.653314] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/43364dc9-2d8e-4fca-89d2-e11dee6fcfd4.8: dentry not found in 43364dc9-2d8e-4fca-89d2-e11dee6fcfd4
.....

So now I installed Linux in to a VM using the above as the VM image.
There were no additional fuse messages while the admin VM was being
installed with our installer (via qemu on the same physical node the
above messages appeared and same node where I ran qemu-img create).

Rebooted the virtual machine and it booted fine. No new messages in
fuse log. So now it's officially booted. This was 'reboot' so qemu
didn't restart.

halted the vm with 'halt', then in virt-manager did a forced shut down.

started vm from scratch.

Still no new messages and it booted fine.

Powered off a physical node and brought it back, still fine.
Reset all physical nodes and brought them back, still fine.

I am unable to trigger this problem. However, once it starts to go bad,
it stays bad and stays bad across all the physical nodes. The kpartx
mount root from within the image then umount it trick is only a
temporary fix that doesn't persist beyond one boot once we're in the
bad state.

So something gets in to a bad state and stays that way but we don't know
how to cause it to happen at will. I will continue to try to reproduce
this as it's causing some huge problems in the field.

On Tue, Jan 26, 2021 at 07:40:19AM -0600, Erik Jacobson wrote:
> Thank you so much for responding! More below.
> 
> 
> >  Anything in the logs of the fuse mount? can you stat the file from the mount?
> > also, the report of an image is only 64M makes me think about Sharding as the
> > default value of Shard size is 64M.
> > Do you have any clues on when this issue start to happen? was there any
> > operation done to the Gluster cluster?
> 
> 
> - I had just created the gluster volumes within an hour of the problem
>   to test the vary problem I reported. So it was a "fresh start".
> 
> - It booted one or two times, then stopped booting. Once it couldn't
>   boot, all 3 nodes were the same in that grub2 couldn't boot in the VM
>   image.
> 
> As for the fuse log, I did see a couple of these before it happened the
> first time. I'm not sure if it's a clue or not.
> 
> [2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: switched to graph 0
> [2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
> [2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
> 
> 
> 
> I have reserved the test system again. My plans today are:
>  - Start over with the gluster volume on the machine with sles15sp2
>    updates
> 
>  - Learn if there are modifications to the image (besides
>    mounting/umounting filesystems with the image using kpartx to map
>    them to force it to work). What if I add/remove a byte from the end
>    of the image file for example.
> 
>  - Revert the setup to sles15sp2 with no updates. My theory is the
>    updates are not making a difference and it's just random chance.
>    (re-making the gluster volume in the process)
> 
>  - The 64MB shard size made me think too!!
> 
>  - If the team feels it is worth it, I could try a newer gluster. We're
>    using the versions we've validated at scale when we have large
>    clusters in the factory but if the team thinks I should try something
>    else I'm happy to re-build it!!!  We are @ 7.2 plus afr-event-gen-changes
>    patch.
> 
> I will keep a better eye on the fuse log to tie an error to the problem
> starting.
> 
> 
> THANKS AGAIN for responding and let me know if you have any more
> clues!
> 
> Erik
> 
> 
> > 
> > On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson <erik.jacobson at hpe.com> wrote:
> > 
> >     Hello all. Thanks again for gluster. We're having a strange problem
> >     getting virtual machines started that are hosted on a gluster volume.
> > 
> >     One of the ways we use gluster now is to make a HA-ish cluster head
> >     node. A virtual machine runs in the shared storage and is backed up by 3
> >     physical servers that contribute to the gluster storage share.
> > 
> >     We're using sharding in this volume. The VM image file is around 5T and
> >     we use qemu-img with falloc to get all the blocks allocated in advance.
> > 
> >     We are not using gfapi largely because it would mean we have to build
> >     our own libvirt and qemu and we'd prefer not to do that. So we're using
> >     a glusterfs fuse mount to host the image. The virtual machine is using
> >     virtio disks but we had similar trouble using scsi emulation.
> > 
> >     The issue: - all seems well, the VM head node installs, boots, etc.
> > 
> >     However, at some point, it stops being able to boot! grub2 acts like it
> >     cannot find /boot. At the grub2 prompt, it can see the partitions, but
> >     reports no filesystem found where there are indeed filesystems.
> > 
> >     If we switch qemu to use "direct kernel load" (bypass grub2), this often
> >     works around the problem but in one case Linux gave us a clue. Linux
> >     reported /dev/vda as only being 64 megabytes, which would explain a lot.
> >     This means the virtual machine Linux though the disk supplied by the
> >     disk image was tiny! 64M instead of 5T
> > 
> >     We are using sles15sp2 and hit the problem more often with updates
> >     applied than without. I'm in the process of trying to isolate if there
> >     is a sles15sp2 update causing this, or if we're within "random chance".
> > 
> >     On one of the physical nodes, if it is in the failure mode, if I use
> >     'kpartx' to create the partitions from the image file, then mount the
> >     giant root filesystem (ie mount /dev/mapper/loop0p31 /mnt) and then
> >     umount /mnt, then that physical node starts the VM fine, grub2 loads,
> >     the virtual machine is fully happy!  Until I try to shut it down and
> >     start it up again, at which point it sticks at grub2 again! What about
> >     mounting the image file makes it so qemu sees the whole disk?
> > 
> >     The problem doesn't always happen but once it starts, the same VM image has
> >     trouble starting on any of the 3 physical nodes sharing the storage.
> >     But using the trick to force-mount the root within the image with
> >     kpartx, then the machine can come up. My only guess is this changes the
> >     file just a tiny bit in the middle of the image.
> > 
> >     Once the problem starts, it keeps happening except temporarily working
> >     when I do the loop mount trick on the physical admin.
> > 
> > 
> >     Here is some info about what I have in place:
> > 
> > 
> >     nano-1:/adminvm/images # gluster volume info
> > 
> >     Volume Name: adminvm
> >     Type: Replicate
> >     Volume ID: 67de902c-8c00-4dc9-8b69-60b93b5f6104
> >     Status: Started
> >     Snapshot Count: 0
> >     Number of Bricks: 1 x 3 = 3
> >     Transport-type: tcp
> >     Bricks:
> >     Brick1: 172.23.255.151:/data/brick_adminvm
> >     Brick2: 172.23.255.152:/data/brick_adminvm
> >     Brick3: 172.23.255.153:/data/brick_adminvm
> >     Options Reconfigured:
> >     performance.client-io-threads: on
> >     nfs.disable: on
> >     storage.fips-mode-rchecksum: on
> >     transport.address-family: inet
> >     performance.quick-read: off
> >     performance.read-ahead: off
> >     performance.io-cache: off
> >     performance.low-prio-threads: 32
> >     network.remote-dio: enable
> >     cluster.eager-lock: enable
> >     cluster.quorum-type: auto
> >     cluster.server-quorum-type: server
> >     cluster.data-self-heal-algorithm: full
> >     cluster.locking-scheme: granular
> >     cluster.shd-max-threads: 8
> >     cluster.shd-wait-qlength: 10000
> >     features.shard: on
> >     user.cifs: off
> >     cluster.choose-local: off
> >     client.event-threads: 4
> >     server.event-threads: 4
> >     cluster.granular-entry-heal: enable
> >     storage.owner-uid: 439
> >     storage.owner-gid: 443
> > 
> > 
> > 
> > 
> >     libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
> >     glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
> >     python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch
> > 
> > 
> > 
> >     nano-1:/adminvm/images # uname -a
> >     Linux nano-1 5.3.18-24.46-default #1 SMP Tue Jan 5 16:11:50 UTC 2021
> >     (4ff469b) x86_64 x86_64 x86_64 GNU/Linux
> >     nano-1:/adminvm/images # rpm -qa | grep qemu-4
> >     qemu-4.2.0-9.4.x86_64
> > 
> > 
> > 
> >     Would love any advice!!!!
> > 
> > 
> >     Erik
> >     ________
> > 
> > 
> > 
> >     Community Meeting Calendar:
> > 
> >     Schedule -
> >     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >     Bridge: https://meet.google.com/cpu-eiue-hvk
> >     Gluster-users mailing list
> >     Gluster-users at gluster.org
> >     https://lists.gluster.org/mailman/listinfo/gluster-users
> > 
> > 
> > 
> > --
> > Respectfully
> > Mahdi