[Gluster-users] XFS corruption reported by QEMU virtual machine with image hosted on gluster
Andreas Schwibbe
a.schwibbe at gmx.net
Tue Nov 5 16:12:26 UTC 2024
Erik,
the original problem sounds to me rather like a qemu problem 😕
When using GFAPI with libvirtd have an eye on selinux/apparmor, this is
sometimes really troublesome!
Your config looks exactly like mine, with the exception, that I don't
use scsi.
I host the plain images (raw & qcow) on the gluster volume like
<target dev="vda" bus="virtio"/>
A.
Am Montag, dem 14.10.2024 um 15:57 +0000 schrieb Jacobson, Erik:
> First a heartfelt thanks for writing back.
>
> In a solution (not having this issue) we do use nfs-ganesha to host
> filesystem squashfs root FS objects to compute nodes. It is working
> great. We also have fuse-through-LIO.
>
> The solution here is 3 servers making up with cluster admin node.
>
> The XFS issue is only observed when we try to replace an existing one
> with another XFS on top, and only with RAW, and only inside the VM.
> So it isn’t like data is being corrupted. However, it’s hard to
> replace a filesystem with another like you would do if you re-install
> one of what may be several operating systems on that disk image.
>
> I am interested in your GFAPI information. I rebuilt RHEL9.4 qemu and
> changed the spec file to produce the needed gluster block package,
> and referred to the image file via the gluster protocol. My system
> got horrible scsi errors and sometimes didn’t even boot from a live
> environment. I repeated the same failure with sles15. I did this with
> a direct setup (not volumes/pools/etc).
>
> I could experiment with Ubuntu if needed so that was a good data
> point.
>
> I am interested in your setup to see what I may have missed. If I
> simply made a mistake configuring GFAPI that would be welcome news.
>
> <devices>
> <emulator>/usr/libexec/qemu-kvm</emulator>
> <disk type='network' device='disk'>
> <driver name='qemu' type='raw' cache='none'/>
> <source protocol='gluster' name='adminvm/images/adminvm.img'
> index='2'>
> <host name='localhost' port='24007'/>
> </source>
> <backingStore/>
> <target dev='sdh' bus='scsi'/>
> <alias name='scsi1-0-0-0'/>
> <address type='drive' controller='1' bus='0' target='0'
> unit='0'/>
> </disk>
>
> From:Gluster-users <gluster-users-bounces at gluster.org> on behalf of
> Andreas Schwibbe <a.schwibbe at gmx.net>
> Date: Monday, October 14, 2024 at 4:34 AM
> To: gluster-users at gluster.org <gluster-users at gluster.org>
> Subject: Re: [Gluster-users] XFS corruption reported by QEMU virtual
> machine with image hosted on gluster
> Hey Erik,
>
> I am running a similar setup with no issues having Ubuntu Host
> Systems on HPE DL380 Gen 10.
> I however used to run libvirt/qemu via nfs-ganesha on top of gluster
> flawlessly.
> Recently I upgraded to the native GFAPI implementation, which is
> poorly documented with snippets all over the internet.
>
> Although I cannot provide a direct solution for your issue, I am
> however suggesting to try either nfs-ganesha as a replacement for
> fuse mount or GFAPI.
> Happy to share libvirt/GFAPI config hints to make it happen.
>
> Best
> A.
>
> Am Sonntag, dem 13.10.2024 um 21:59 +0000 schrieb Jacobson, Erik:
> > Hello all! We are experiencing a strange problem with QEMU virtual
> > machines where the virtual machine image is hosted on a gluster
> > volume. Access via fuse. (Our GFAPI attempt failed, it doesn’t seem
> > to work properly with current QEMU/distro/gluster). We have the
> > volume tuned for ‘virt’.
> >
> > So we use qemu-img to create a raw image. You can use sparse or
> > falloc with equal results. We start a virtual machine (libvirt,
> > qemu-kvm) and libvirt/qemu points to the fuse mount with the QEMU
> > image file we created.
> >
> > When we create partitions and filesystems – like you might do for
> > installing an operating system – all is well at first. This
> > includes a root XFS filesystem.
> >
> > When we try to re-make the XFS filesystem over the old one, it will
> > not mount and will report XFS corruption.
> > If you dig into XFS repair, you can find a UUID mismatch between
> > the superblock and the log. The log always retains the UUID of the
> > original filesystem (the one we tried to replace). Running
> > xfs_repair doesn’t truly repair, it just reports more corruption.
> > xfs_db forcing to remake the log doesn’t help.
> >
> > We can duplicate this with even a QEMU raw image of 50 megabytes.
> > As far as we can tell, XFS is the only filesystem showing this
> > behavior or at least the only one reporting a problem.
> >
> > If we take QEMU out of the picture and create partitions directly
> > on the QEMU raw image file, then use kpartx to create devices to
> > the partitions, and run a similar test – the gluster-hosted image
> > behaves as you would expect and there is no problem reported by
> > XFS. We can’t duplicate the problem outside of QEMU.
> >
> > We have observed the issue with Rocky 9.4 and SLES15 SP5
> > environments (including the matching QEMU versions). We have not
> > tested more distros yet.
> >
> > We observed the problem originally with Gluster 9.3. We reproduced
> > it with Gluster 9.6 and 10.5.
> >
> > If we switch from QEMU RAW to QCOW2, the problem disappears.
> >
> > The problem is not reproduced when we take gluster out of the
> > equation (meaning, pointing QEMU at a local disk image instead of
> > gluster-hosted one – that works fine).
> >
> > The problem can be reproduced this way:
> > * Assume /adminvm/images on a gluster sharded volume
> > * rm /adminvm/images/adminvm.img
> > * qemu-img create -f raw /adminvm/images/adminvm.img 50M
> >
> > Now start the virtual machine that refers to the above adminvm.img
> > file
> > * Boot up a rescue environment or a live mode or similar
> > * sgdisk --zap-all /dev/sda
> > * sgdisk --set-alignment=4096 --clear /dev/sda
> > * sgdisk --set-alignment=4096 --new=1:0:0 /dev/sda
> > * mkfs.xfs -L fs1 /dev/sda1
> > * mkdir -p /a
> > * mount /dev/sda1 /a
> > * umount /a
> > * # MAKE same FS again:
> > * mkfs.xfs -f -L fs1 /dev/sda1
> > * mount /dev/sda1 /a
> > * This will fail with kernel back traces and corruption reported
> > * xfs_repair will report the log vs superblock UUID mismatch I
> > mentioned
> >
> > Here are the volume settings:
> >
> > # gluster volume info adminvm
> >
> > Volume Name: adminvm
> > Type: Replicate
> > Volume ID: de655913-aad9-4e17-bac4-ff0ad9c28223
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 1 x 3 = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: 172.23.254.181:/data/brick_adminvm_slot2
> > Brick2: 172.23.254.182:/data/brick_adminvm_slot2
> > Brick3: 172.23.254.183:/data/brick_adminvm_slot2
> > Options Reconfigured:
> > storage.owner-gid: 107
> > storage.owner-uid: 107
> > performance.io-thread-count: 32
> > network.frame-timeout: 10800
> > cluster.lookup-optimize: off
> > server.keepalive-count: 5
> > server.keepalive-interval: 2
> > server.keepalive-time: 10
> > server.tcp-user-timeout: 20
> > network.ping-timeout: 20
> > server.event-threads: 4
> > client.event-threads: 4
> > cluster.choose-local: off
> > user.cifs: off
> > features.shard: on
> > cluster.shd-wait-qlength: 10000
> > cluster.shd-max-threads: 8
> > cluster.locking-scheme: granular
> > cluster.data-self-heal-algorithm: full
> > cluster.server-quorum-type: server
> > cluster.quorum-type: auto
> > cluster.eager-lock: enable
> > performance.strict-o-direct: on
> > network.remote-dio: disable
> > performance.low-prio-threads: 32
> > performance.io-cache: off
> > performance.read-ahead: off
> > performance.quick-read: off
> > cluster.granular-entry-heal: enable
> > storage.fips-mode-rchecksum: on
> > transport.address-family: inet
> > nfs.disable: on
> > performance.client-io-threads: on
> >
> > Any help or ideas would be appreciated. Let us know if we have a
> > setting incorrect or have made an error.
> >
> > Thank you all!
> >
> > Erik
> > ________
> >
> >
> >
> > Community Meeting Calendar:
> >
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20241105/01a6c9e6/attachment.html>
More information about the Gluster-users
mailing list