[Gluster-users] qemu raw image file - qemu and grub2 can't find boot content from VM

Amar Tumballi amar at kadalu.io
Wed Jan 27 03:54:29 UTC 2021


Was a volume with existing data got converted to sharding volume?

On Wed, Jan 27, 2021 at 5:06 AM Erik Jacobson <erik.jacobson at hpe.com> wrote:

> Shortly after the sharded volume is made, there are some fuse mount
> messages. I'm not 100% sure if this was just before or during the
> big qemu-img command to make the 5T image
> (qemu-img create -f raw -o preallocation=falloc
> /adminvm/images/adminvm.img 5T)
>
>
> (from /var/log/glusterfs/adminvm.log)
> [2021-01-26 19:18:21.287697] I [fuse-bridge.c:5166:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
> 7.31
> [2021-01-26 19:18:21.287719] I [fuse-bridge.c:5777:fuse_graph_sync]
> 0-fuse: switched to graph 0
> [2021-01-26 19:18:23.945566] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-2: remote
> operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.7
> (00000000-0000-0000-0000-000000000000) [No data available]
> [2021-01-26 19:18:54.089721] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote
> operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.85
> (00000000-0000-0000-0000-000000000000) [No data available]
> [2021-01-26 19:18:54.089784] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-1: remote
> operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.85
> (00000000-0000-0000-0000-000000000000) [No data available]
> [2021-01-26 19:18:55.048613] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-1: remote
> operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.88
> (00000000-0000-0000-0000-000000000000) [No data available]
> [2021-01-26 19:18:55.355131] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote
> operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.89
> (00000000-0000-0000-0000-000000000000) [No data available]
> [2021-01-26 19:18:55.981094] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote
> operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.91
> (00000000-0000-0000-0000-000000000000) [No data available]
> ......
>
>
> Towards the end (or just after, it's hard to tell) of the qemu-img
> create command, these msgs showed up in the adminvm.log. I just supplied
> the first few. There were many:
>
>
> [2021-01-26 19:28:40.652898] W [MSGID: 101159]
> [inode.c:1212:__inode_unlink] 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/48bb5288-e27e-46c9-9f7c-944a804df361.1:
> dentry not found in 48bb5288-e27e-46c9-9f7c-944a804df361
> [2021-01-26 19:28:40.652975] W [MSGID: 101159]
> [inode.c:1212:__inode_unlink] 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/931508ed-9368-4982-a53e-7187a9f0c1f9.3:
> dentry not found in 931508ed-9368-4982-a53e-7187a9f0c1f9
> [2021-01-26 19:28:40.653047] W [MSGID: 101159]
> [inode.c:1212:__inode_unlink] 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/e808ecab-2e70-4ef3-954e-ce1b78ed8b52.4:
> dentry not found in e808ecab-2e70-4ef3-954e-ce1b78ed8b52
> [2021-01-26 19:28:40.653102] W [MSGID: 101159]
> [inode.c:1212:__inode_unlink] 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/2c62c383-d869-4655-9c03-f08a86a874ba.6:
> dentry not found in 2c62c383-d869-4655-9c03-f08a86a874ba
> [2021-01-26 19:28:40.653169] W [MSGID: 101159]
> [inode.c:1212:__inode_unlink] 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/556ffbc9-bcbe-445a-93f5-13784c5a6df1.2:
> dentry not found in 556ffbc9-bcbe-445a-93f5-13784c5a6df1
> [2021-01-26 19:28:40.653218] W [MSGID: 101159]
> [inode.c:1212:__inode_unlink] 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/5d414e7c-335d-40da-bb96-6c427181338b.5:
> dentry not found in 5d414e7c-335d-40da-bb96-6c427181338b
> [2021-01-26 19:28:40.653314] W [MSGID: 101159]
> [inode.c:1212:__inode_unlink] 0-inode:
> be318638-e8a0-4c6d-977d-7a937aa84806/43364dc9-2d8e-4fca-89d2-e11dee6fcfd4.8:
> dentry not found in 43364dc9-2d8e-4fca-89d2-e11dee6fcfd4
> .....
>
>
> So now I installed Linux in to a VM using the above as the VM image.
> There were no additional fuse messages while the admin VM was being
> installed with our installer (via qemu on the same physical node the
> above messages appeared and same node where I ran qemu-img create).
>
> Rebooted the virtual machine and it booted fine. No new messages in
> fuse log. So now it's officially booted. This was 'reboot' so qemu
> didn't restart.
>
> halted the vm with 'halt', then in virt-manager did a forced shut down.
>
> started vm from scratch.
>
> Still no new messages and it booted fine.
>
> Powered off a physical node and brought it back, still fine.
> Reset all physical nodes and brought them back, still fine.
>
> I am unable to trigger this problem. However, once it starts to go bad,
> it stays bad and stays bad across all the physical nodes. The kpartx
> mount root from within the image then umount it trick is only a
> temporary fix that doesn't persist beyond one boot once we're in the
> bad state.
>
> So something gets in to a bad state and stays that way but we don't know
> how to cause it to happen at will. I will continue to try to reproduce
> this as it's causing some huge problems in the field.
>
>
>
>
> On Tue, Jan 26, 2021 at 07:40:19AM -0600, Erik Jacobson wrote:
> > Thank you so much for responding! More below.
> >
> >
> > >  Anything in the logs of the fuse mount? can you stat the file from
> the mount?
> > > also, the report of an image is only 64M makes me think about Sharding
> as the
> > > default value of Shard size is 64M.
> > > Do you have any clues on when this issue start to happen? was there any
> > > operation done to the Gluster cluster?
> >
> >
> > - I had just created the gluster volumes within an hour of the problem
> >   to test the vary problem I reported. So it was a "fresh start".
> >
> > - It booted one or two times, then stopped booting. Once it couldn't
> >   boot, all 3 nodes were the same in that grub2 couldn't boot in the VM
> >   image.
> >
> > As for the fuse log, I did see a couple of these before it happened the
> > first time. I'm not sure if it's a clue or not.
> >
> > [2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync]
> 0-fuse: switched to graph 0
> > [2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W]
> (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa]
> (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a]
> (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb]
> (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (-->
> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse:
> writing to fuse device failed: No such file or directory
> > [2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W]
> (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa]
> (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a]
> (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb]
> (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (-->
> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse:
> writing to fuse device failed: No such file or directory
> >
> >
> >
> > I have reserved the test system again. My plans today are:
> >  - Start over with the gluster volume on the machine with sles15sp2
> >    updates
> >
> >  - Learn if there are modifications to the image (besides
> >    mounting/umounting filesystems with the image using kpartx to map
> >    them to force it to work). What if I add/remove a byte from the end
> >    of the image file for example.
> >
> >  - Revert the setup to sles15sp2 with no updates. My theory is the
> >    updates are not making a difference and it's just random chance.
> >    (re-making the gluster volume in the process)
> >
> >  - The 64MB shard size made me think too!!
> >
> >  - If the team feels it is worth it, I could try a newer gluster. We're
> >    using the versions we've validated at scale when we have large
> >    clusters in the factory but if the team thinks I should try something
> >    else I'm happy to re-build it!!!  We are @ 7.2 plus
> afr-event-gen-changes
> >    patch.
> >
> > I will keep a better eye on the fuse log to tie an error to the problem
> > starting.
> >
> >
> > THANKS AGAIN for responding and let me know if you have any more
> > clues!
> >
> > Erik
> >
> >
> > >
> > > On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson <erik.jacobson at hpe.com>
> wrote:
> > >
> > >     Hello all. Thanks again for gluster. We're having a strange problem
> > >     getting virtual machines started that are hosted on a gluster
> volume.
> > >
> > >     One of the ways we use gluster now is to make a HA-ish cluster head
> > >     node. A virtual machine runs in the shared storage and is backed
> up by 3
> > >     physical servers that contribute to the gluster storage share.
> > >
> > >     We're using sharding in this volume. The VM image file is around
> 5T and
> > >     we use qemu-img with falloc to get all the blocks allocated in
> advance.
> > >
> > >     We are not using gfapi largely because it would mean we have to
> build
> > >     our own libvirt and qemu and we'd prefer not to do that. So we're
> using
> > >     a glusterfs fuse mount to host the image. The virtual machine is
> using
> > >     virtio disks but we had similar trouble using scsi emulation.
> > >
> > >     The issue: - all seems well, the VM head node installs, boots, etc.
> > >
> > >     However, at some point, it stops being able to boot! grub2 acts
> like it
> > >     cannot find /boot. At the grub2 prompt, it can see the partitions,
> but
> > >     reports no filesystem found where there are indeed filesystems.
> > >
> > >     If we switch qemu to use "direct kernel load" (bypass grub2), this
> often
> > >     works around the problem but in one case Linux gave us a clue.
> Linux
> > >     reported /dev/vda as only being 64 megabytes, which would explain
> a lot.
> > >     This means the virtual machine Linux though the disk supplied by
> the
> > >     disk image was tiny! 64M instead of 5T
> > >
> > >     We are using sles15sp2 and hit the problem more often with updates
> > >     applied than without. I'm in the process of trying to isolate if
> there
> > >     is a sles15sp2 update causing this, or if we're within "random
> chance".
> > >
> > >     On one of the physical nodes, if it is in the failure mode, if I
> use
> > >     'kpartx' to create the partitions from the image file, then mount
> the
> > >     giant root filesystem (ie mount /dev/mapper/loop0p31 /mnt) and then
> > >     umount /mnt, then that physical node starts the VM fine, grub2
> loads,
> > >     the virtual machine is fully happy!  Until I try to shut it down
> and
> > >     start it up again, at which point it sticks at grub2 again! What
> about
> > >     mounting the image file makes it so qemu sees the whole disk?
> > >
> > >     The problem doesn't always happen but once it starts, the same VM
> image has
> > >     trouble starting on any of the 3 physical nodes sharing the
> storage.
> > >     But using the trick to force-mount the root within the image with
> > >     kpartx, then the machine can come up. My only guess is this
> changes the
> > >     file just a tiny bit in the middle of the image.
> > >
> > >     Once the problem starts, it keeps happening except temporarily
> working
> > >     when I do the loop mount trick on the physical admin.
> > >
> > >
> > >     Here is some info about what I have in place:
> > >
> > >
> > >     nano-1:/adminvm/images # gluster volume info
> > >
> > >     Volume Name: adminvm
> > >     Type: Replicate
> > >     Volume ID: 67de902c-8c00-4dc9-8b69-60b93b5f6104
> > >     Status: Started
> > >     Snapshot Count: 0
> > >     Number of Bricks: 1 x 3 = 3
> > >     Transport-type: tcp
> > >     Bricks:
> > >     Brick1: 172.23.255.151:/data/brick_adminvm
> > >     Brick2: 172.23.255.152:/data/brick_adminvm
> > >     Brick3: 172.23.255.153:/data/brick_adminvm
> > >     Options Reconfigured:
> > >     performance.client-io-threads: on
> > >     nfs.disable: on
> > >     storage.fips-mode-rchecksum: on
> > >     transport.address-family: inet
> > >     performance.quick-read: off
> > >     performance.read-ahead: off
> > >     performance.io-cache: off
> > >     performance.low-prio-threads: 32
> > >     network.remote-dio: enable
> > >     cluster.eager-lock: enable
> > >     cluster.quorum-type: auto
> > >     cluster.server-quorum-type: server
> > >     cluster.data-self-heal-algorithm: full
> > >     cluster.locking-scheme: granular
> > >     cluster.shd-max-threads: 8
> > >     cluster.shd-wait-qlength: 10000
> > >     features.shard: on
> > >     user.cifs: off
> > >     cluster.choose-local: off
> > >     client.event-threads: 4
> > >     server.event-threads: 4
> > >     cluster.granular-entry-heal: enable
> > >     storage.owner-uid: 439
> > >     storage.owner-gid: 443
> > >
> > >
> > >
> > >
> > >     libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
> > >     glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64
> > >     python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch
> > >
> > >
> > >
> > >     nano-1:/adminvm/images # uname -a
> > >     Linux nano-1 5.3.18-24.46-default #1 SMP Tue Jan 5 16:11:50 UTC
> 2021
> > >     (4ff469b) x86_64 x86_64 x86_64 GNU/Linux
> > >     nano-1:/adminvm/images # rpm -qa | grep qemu-4
> > >     qemu-4.2.0-9.4.x86_64
> > >
> > >
> > >
> > >     Would love any advice!!!!
> > >
> > >
> > >     Erik
> > >     ________
> > >
> > >
> > >
> > >     Community Meeting Calendar:
> > >
> > >     Schedule -
> > >     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > >     Bridge: https://meet.google.com/cpu-eiue-hvk
> > >     Gluster-users mailing list
> > >     Gluster-users at gluster.org
> > >     https://lists.gluster.org/mailman/listinfo/gluster-users
> > >
> > >
> > >
> > > --
> > > Respectfully
> > > Mahdi
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>


-- 
--
https://kadalu.io
Container Storage made easy!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210127/32b360b5/attachment.html>


More information about the Gluster-users mailing list