[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

Tue Jan 26 13:40:19 UTC 2021

Thank you so much for responding! More below.

>  Anything in the logs of the fuse mount? can you stat the file from the mount?
> also, the report of an image is only 64M makes me think about Sharding as the
> default value of Shard size is 64M.
> Do you have any clues on when this issue start to happen? was there any
> operation done to the Gluster cluster?

- I had just created the gluster volumes within an hour of the problem
  to test the vary problem I reported. So it was a "fresh start".

- It booted one or two times, then stopped booting. Once it couldn't
  boot, all 3 nodes were the same in that grub2 couldn't boot in the VM
  image.

As for the fuse log, I did see a couple of these before it happened the
first time. I'm not sure if it's a clue or not.

[2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: switched to graph 0
[2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
[2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory

I have reserved the test system again. My plans today are:
 - Start over with the gluster volume on the machine with sles15sp2
   updates

 - Learn if there are modifications to the image (besides
   mounting/umounting filesystems with the image using kpartx to map
   them to force it to work). What if I add/remove a byte from the end
   of the image file for example.

 - Revert the setup to sles15sp2 with no updates. My theory is the
   updates are not making a difference and it's just random chance.
   (re-making the gluster volume in the process)

 - The 64MB shard size made me think too!!

 - If the team feels it is worth it, I could try a newer gluster. We're
   using the versions we've validated at scale when we have large
   clusters in the factory but if the team thinks I should try something
   else I'm happy to re-build it!!!  We are @ 7.2 plus afr-event-gen-changes
   patch.

I will keep a better eye on the fuse log to tie an error to the problem
starting.

THANKS AGAIN for responding and let me know if you have any more
clues!

Erik

> 
> On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson <erik.jacobson at hpe.com> wrote:
> 
>     Hello all. Thanks again for gluster. We're having a strange problem
>     getting virtual machines started that are hosted on a gluster volume.
> 
>     One of the ways we use gluster now is to make a HA-ish cluster head
>     node. A virtual machine runs in the shared storage and is backed up by 3
>     physical servers that contribute to the gluster storage share.
> 
>     We're using sharding in this volume. The VM image file is around 5T and
>     we use qemu-img with falloc to get all the blocks allocated in advance.
> 
>     We are not using gfapi largely because it would mean we have to build
>     our own libvirt and qemu and we'd prefer not to do that. So we're using
>     a glusterfs fuse mount to host the image. The virtual machine is using
>     virtio disks but we had similar trouble using scsi emulation.
> 
>     The issue: - all seems well, the VM head node installs, boots, etc.
> 
>     However, at some point, it stops being able to boot! grub2 acts like it
>     cannot find /boot. At the grub2 prompt, it can see the partitions, but
>     reports no filesystem found where there are indeed filesystems.
> 
>     If we switch qemu to use "direct kernel load" (bypass grub2), this often
>     works around the problem but in one case Linux gave us a clue. Linux
>     reported /dev/vda as only being 64 megabytes, which would explain a lot.
>     This means the virtual machine Linux though the disk supplied by the
>     disk image was tiny! 64M instead of 5T
> 
>     We are using sles15sp2 and hit the problem more often with updates
>     applied than without. I'm in the process of trying to isolate if there
>     is a sles15sp2 update causing this, or if we're within "random chance".
> 
>     On one of the physical nodes, if it is in the failure mode, if I use
>     'kpartx' to create the partitions from the image file, then mount the
>     giant root filesystem (ie mount /dev/mapper/loop0p31 /mnt) and then
>     umount /mnt, then that physical node starts the VM fine, grub2 loads,
>     the virtual machine is fully happy!  Until I try to shut it down and
>     start it up again, at which point it sticks at grub2 again! What about
>     mounting the image file makes it so qemu sees the whole disk?
> 
>     The problem doesn't always happen but once it starts, the same VM image has
>     trouble starting on any of the 3 physical nodes sharing the storage.
>     But using the trick to force-mount the root within the image with
>     kpartx, then the machine can come up. My only guess is this changes the
>     file just a tiny bit in the middle of the image.
> 
>     Once the problem starts, it keeps happening except temporarily working
>     when I do the loop mount trick on the physical admin.
> 
> 
>     Here is some info about what I have in place:
> 
> 
>     nano-1:/adminvm/images # gluster volume info
> 
>     Volume Name: adminvm
>     Type: Replicate
>     Volume ID: 67de902c-8c00-4dc9-8b69-60b93b5f6104
>     Status: Started
>     Snapshot Count: 0
>     Number of Bricks: 1 x 3 = 3
>     Transport-type: tcp
>     Bricks:
>     Brick1: 172.23.255.151:/data/brick_adminvm
>     Brick2: 172.23.255.152:/data/brick_adminvm
>     Brick3: 172.23.255.153:/data/brick_adminvm
>     Options Reconfigured:
>     performance.client-io-threads: on
>     nfs.disable: on
>     storage.fips-mode-rchecksum: on
>     transport.address-family: inet
>     performance.quick-read: off
>     performance.read-ahead: off
>     performance.io-cache: off
>     performance.low-prio-threads: 32
>     network.remote-dio: enable
>     cluster.eager-lock: enable
>     cluster.quorum-type: auto
>     cluster.server-quorum-type: server
>     cluster.data-self-heal-algorithm: full
>     cluster.locking-scheme: granular
>     cluster.shd-max-threads: 8
>     cluster.shd-wait-qlength: 10000
>     features.shard: on
>     user.cifs: off
>     cluster.choose-local: off
>     client.event-threads: 4
>     server.event-threads: 4
>     cluster.granular-entry-heal: enable
>     storage.owner-uid: 439
>     storage.owner-gid: 443
> 
> 
> 
> 
>     libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
>     glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
>     python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch
> 
> 
> 
>     nano-1:/adminvm/images # uname -a
>     Linux nano-1 5.3.18-24.46-default #1 SMP Tue Jan 5 16:11:50 UTC 2021
>     (4ff469b) x86_64 x86_64 x86_64 GNU/Linux
>     nano-1:/adminvm/images # rpm -qa | grep qemu-4
>     qemu-4.2.0-9.4.x86_64
> 
> 
> 
>     Would love any advice!!!!
> 
> 
>     Erik
>     ________
> 
> 
> 
>     Community Meeting Calendar:
> 
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://meet.google.com/cpu-eiue-hvk
>     Gluster-users mailing list
>     Gluster-users at gluster.org
>     https://lists.gluster.org/mailman/listinfo/gluster-users
> 
> 
> 
> --
> Respectfully
> Mahdi