[Gluster-users] XFS corruption reported by QEMU virtual machine with image hosted on gluster

Mon Oct 14 09:33:39 UTC 2024

Hey Erik,

I am running a similar setup with no issues having Ubuntu Host Systems
on HPE DL380 Gen 10.
I however used to run libvirt/qemu via nfs-ganesha on top of gluster
flawlessly.
Recently I upgraded to the native GFAPI implementation, which is poorly
documented with snippets all over the internet.

Although I cannot provide a direct solution for your issue, I am
however suggesting to try either nfs-ganesha as a replacement for fuse
mount or GFAPI. 
Happy to share libvirt/GFAPI config hints to make it happen.

Best
A.

Am Sonntag, dem 13.10.2024 um 21:59 +0000 schrieb Jacobson, Erik:
> Hello all! We are experiencing a strange problem with QEMU virtual
> machines where the virtual machine image is hosted on a gluster
> volume. Access via fuse. (Our GFAPI attempt failed, it doesn’t seem
> to work properly with current QEMU/distro/gluster). We have the
> volume tuned for ‘virt’.
>  
> So we use qemu-img to create a raw image. You can use sparse or
> falloc with equal results. We start a virtual machine (libvirt, qemu-
> kvm) and libvirt/qemu points to the fuse mount with the QEMU image
> file we created.
>  
> When we create partitions and filesystems – like you might do for
> installing an operating system – all is well at first. This includes
> a root XFS filesystem.
>  
> When we try to re-make the XFS filesystem over the old one, it will
> not mount and will report XFS corruption.
> If you dig into XFS repair, you can find a UUID mismatch between the
> superblock and the log. The log always retains the UUID of the
> original filesystem (the one we tried to replace). Running xfs_repair
> doesn’t truly repair, it just reports more corruption. xfs_db forcing
> to remake the log doesn’t help.
>  
> We can duplicate this with even a QEMU raw image of 50 megabytes. As
> far as we can tell, XFS is the only filesystem showing this behavior
> or at least the only one reporting a problem.
>  
> If we take QEMU out of the picture and create partitions directly on
> the QEMU raw image file, then use kpartx to create devices to the
> partitions, and run a similar test – the gluster-hosted image behaves
> as you would expect and there is no problem reported by XFS. We can’t
> duplicate the problem outside of QEMU.
>  
> We have observed the issue with Rocky 9.4 and SLES15 SP5 environments
> (including the matching QEMU versions). We have not tested more
> distros yet.
>  
> We observed the problem originally with Gluster 9.3. We reproduced it
> with Gluster 9.6 and 10.5.
>  
> If we switch from QEMU RAW to QCOW2, the problem disappears.
>  
> The problem is not reproduced when we take gluster out of the
> equation (meaning, pointing QEMU at a local disk image instead of
> gluster-hosted one – that works fine).
>  
> The problem can be reproduced this way:
> * Assume /adminvm/images on a gluster sharded volume
> * rm /adminvm/images/adminvm.img
> * qemu-img create -f raw /adminvm/images/adminvm.img 50M
>  
> Now start the virtual machine that refers to the above adminvm.img
> file
> * Boot up a rescue environment or a live mode or similar
> * sgdisk --zap-all /dev/sda
> * sgdisk --set-alignment=4096 --clear /dev/sda
> * sgdisk --set-alignment=4096 --new=1:0:0 /dev/sda
> * mkfs.xfs -L fs1 /dev/sda1
> * mkdir -p /a
> * mount /dev/sda1 /a
> * umount /a
> * # MAKE same FS again:
> * mkfs.xfs -f -L fs1 /dev/sda1
> * mount /dev/sda1 /a
> * This will fail with kernel back traces and corruption reported
> * xfs_repair will report the log vs superblock UUID mismatch I
> mentioned
>  
> Here are the volume settings:
>  
> # gluster volume info adminvm
>  
> Volume Name: adminvm
> Type: Replicate
> Volume ID: de655913-aad9-4e17-bac4-ff0ad9c28223
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.23.254.181:/data/brick_adminvm_slot2
> Brick2: 172.23.254.182:/data/brick_adminvm_slot2
> Brick3: 172.23.254.183:/data/brick_adminvm_slot2
> Options Reconfigured:
> storage.owner-gid: 107
> storage.owner-uid: 107
> performance.io-thread-count: 32
> network.frame-timeout: 10800
> cluster.lookup-optimize: off
> server.keepalive-count: 5
> server.keepalive-interval: 2
> server.keepalive-time: 10
> server.tcp-user-timeout: 20
> network.ping-timeout: 20
> server.event-threads: 4
> client.event-threads: 4
> cluster.choose-local: off
> user.cifs: off
> features.shard: on
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> performance.strict-o-direct: on
> network.remote-dio: disable
> performance.low-prio-threads: 32
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> cluster.granular-entry-heal: enable
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: on
>  
> Any help or ideas would be appreciated. Let us know if we have a
> setting incorrect or have made an error.
>  
> Thank you all!
>  
> Erik
> ________
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20241014/0c2deaeb/attachment.html>