<div dir="ltr">Hi,<div><br></div><div>I am having a problem recently with Gluster disperse volumes and live merge on qemu-kvm.</div><div><br></div><div>I am using Gluster as a storage backend of an oVirt cluster; we are planning to use VM snapshots in the process of taking daily backups on the VMs and we are encountering issues when the VMs are stored in a distributed-disperse volume.</div><div><br></div><div>First of all, I am using gluster 7.5, libvirt 6.0, qemu 4.2 and oVirt 4.4.0 on CentOS 8.1</div><div><br></div><div>The sequence of events is the following:</div><div><br></div><div>1) On a running VM, create a new snapshot</div><div><br></div><div>The operation completes successfully, however I can observe the following errors on the gluster logs:</div><div><br></div><div><font face="monospace">[2020-06-29 21:54:18.942422] I [MSGID: 109066] [dht-rename.c:1951:dht_rename] 0-SSD_Storage-dht: renaming /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/64c038a4-5fe4-4f57-8b1c-bab38ae5c5bb.meta.new (a89f2ccb-be41-4ff7-bbaf-abb786e76bc7) (hash=SSD_Storage-disperse-1/cache=SSD_Storage-disperse-1) =&gt; /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/64c038a4-5fe4-4f57-8b1c-bab38ae5c5bb.meta (f55c1f35-63fa-4d27-9aa9-09b60163e565) (hash=SSD_Storage-disperse-2/cache=SSD_Storage-disperse-1)  <br>[2020-06-29 21:54:18.947273] W [MSGID: 122019] [ec-helpers.c:401:ec_loc_gfid_check] 0-SSD_Storage-disperse-2: Mismatching GFID&#39;s in loc <br>[2020-06-29 21:54:18.947290] W [MSGID: 109002] [dht-rename.c:1019:dht_rename_links_create_cbk] 0-SSD_Storage-dht: link/file /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/64c038a4-5fe4-4f57-8b1c-bab38ae5c5bb.meta on SSD_Storage-disperse-2 failed [Input/output error]<br>[2020-06-29 21:54:19.197482] I [MSGID: 109066] [dht-rename.c:1951:dht_rename] 0-SSD_Storage-dht: renaming /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/a54793c1-c804-425d-894e-2dfe7a63af4b.meta.new (b4888032-3758-4f62-a4ae-fb48902f83d2) (hash=SSD_Storage-disperse-4/cache=SSD_Storage-disperse-4) =&gt; /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/a54793c1-c804-425d-894e-2dfe7a63af4b.meta ((null)) (hash=SSD_Storage-disperse-4/cache=&lt;nul&gt;)  </font><br></div><div><br></div><div>2) Once the snapshot has been created, try to delete it while the VM is running</div><div><br></div><div>The above seems to be running for a couple of seconds and then suddenly the qemu-kvm process crashes. On the qemu VM logs I can see the following:</div><div><br></div><div><font face="monospace">Unexpected error in raw_check_lock_bytes() at block/file-posix.c:811:<br>2020-06-29T21:56:23.933603Z qemu-kvm: Failed to get shared &quot;write&quot; lock</font><br></div><div><br></div><div>At the same time, the gluster logs report the following:</div><div><br></div><div><font face="monospace">[2020-06-29 21:56:23.850417] I [MSGID: 109066] [dht-rename.c:1951:dht_rename] 0-SSD_Storage-dht: renaming /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/64c038a4-5fe4-4f57-8b1c-bab38ae5c5bb.meta.new (1999a713-a0ed-45fb-8ab7-7dbda6d02a78) (hash=SSD_Storage-disperse-1/cache=SSD_Storage-disperse-1) =&gt; /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/64c038a4-5fe4-4f57-8b1c-bab38ae5c5bb.meta (a89f2ccb-be41-4ff7-bbaf-abb786e76bc7) (hash=SSD_Storage-disperse-2/cache=SSD_Storage-disperse-1)  <br>[2020-06-29 21:56:23.855027] W [MSGID: 122019] [ec-helpers.c:401:ec_loc_gfid_check] 0-SSD_Storage-disperse-2: Mismatching GFID&#39;s in loc <br>[2020-06-29 21:56:23.855045] W [MSGID: 109002] [dht-rename.c:1019:dht_rename_links_create_cbk] 0-SSD_Storage-dht: link/file /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/64c038a4-5fe4-4f57-8b1c-bab38ae5c5bb.meta on SSD_Storage-disperse-2 failed [Input/output error]<br>[2020-06-29 21:56:23.922638] I [MSGID: 109066] [dht-rename.c:1951:dht_rename] 0-SSD_Storage-dht: renaming /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/a54793c1-c804-425d-894e-2dfe7a63af4b.meta.new (e5c578b3-b91a-4263-a7e3-40f9c7e3628b) (hash=SSD_Storage-disperse-4/cache=SSD_Storage-disperse-4) =&gt; /58e8dff0-3dfd-4554-9999-b8eb7744ce1b/images/998f0b18-1904-47f3-8cfb-a73ad063ab83/a54793c1-c804-425d-894e-2dfe7a63af4b.meta (b4888032-3758-4f62-a4ae-fb48902f83d2) (hash=SSD_Storage-disperse-4/cache=SSD_Storage-disperse-4)  <br>[2020-06-29 21:56:26.017309] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--&gt; /lib64/libglusterfs.so.0(_gf_log_callingfn+0x133)[0x7fd4fa4d6a53] (--&gt; /usr/lib64/glusterfs/7.5/xlator/mount/fuse.so(+0x8e82)[0x7fd4f64cee82] (--&gt; /usr/lib64/glusterfs/7.5/xlator/mount/fuse.so(+0xa072)[0x7fd4f64d0072] (--&gt; /lib64/libpthread.so.0(+0x82de)[0x7fd4f90582de] (--&gt; /lib64/libc.so.6(clone+0x43)[0x7fd4f88aa133] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory<br>[2020-06-29 21:56:26.017421] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--&gt; /lib64/libglusterfs.so.0(_gf_log_callingfn+0x133)[0x7fd4fa4d6a53] (--&gt; /usr/lib64/glusterfs/7.5/xlator/mount/fuse.so(+0x8e82)[0x7fd4f64cee82] (--&gt; /usr/lib64/glusterfs/7.5/xlator/mount/fuse.so(+0xa072)[0x7fd4f64d0072] (--&gt; /lib64/libpthread.so.0(+0x82de)[0x7fd4f90582de] (--&gt; /lib64/libc.so.6(clone+0x43)[0x7fd4f88aa133] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory<br>[2020-06-29 21:56:26.017524] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--&gt; /lib64/libglusterfs.so.0(_gf_log_callingfn+0x133)[0x7fd4fa4d6a53] (--&gt; /usr/lib64/glusterfs/7.5/xlator/mount/fuse.so(+0x8e82)[0x7fd4f64cee82] (--&gt; /usr/lib64/glusterfs/7.5/xlator/mount/fuse.so(+0xa072)[0x7fd4f64d0072] (--&gt; /lib64/libpthread.so.0(+0x82de)[0x7fd4f90582de] (--&gt; /lib64/libc.so.6(clone+0x43)[0x7fd4f88aa133] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory</font><br></div><div><font face="monospace"><br></font></div><div><font face="arial, sans-serif">Initially I thought this was a qemu-kvm issue; however the above works perfectly on a distributed-replicated volume on exactly the same HW, software and gluster volume options.</font></div><div><font face="arial, sans-serif">Also, the issue can be replicated 100% of the times -- every time I try to delete the snapshot the process crashes.</font></div><div><font face="arial, sans-serif"><br></font></div><div><font face="arial, sans-serif">Not sure what&#39;s the best way to proceed -- I have tried to file a bug but unfortunately didn&#39;t get any traction.</font></div><div>Gluster volume info here:</div><div><font face="monospace"><br></font></div><div><font face="monospace">Volume Name: SSD_Storage<br>Type: Distributed-Disperse<br>Volume ID: 4e1bf45d-9ecd-44f2-acde-dd338e18379c<br>Status: Started<br>Snapshot Count: 0<br>Number of Bricks: 6 x (4 + 2) = 36<br>Transport-type: tcp<br>Bricks:<br>Brick1: cld-cnvirt-h01-storage:/bricks/vm_b1/brick<br>Brick2: cld-cnvirt-h02-storage:/bricks/vm_b1/brick<br>Brick3: cld-cnvirt-h03-storage:/bricks/vm_b1/brick<br>Brick4: cld-cnvirt-h04-storage:/bricks/vm_b1/brick<br>Brick5: cld-cnvirt-h05-storage:/bricks/vm_b1/brick<br>Brick6: cld-cnvirt-h06-storage:/bricks/vm_b1/brick<br>Brick7: cld-cnvirt-h01-storage:/bricks/vm_b2/brick<br>Brick8: cld-cnvirt-h02-storage:/bricks/vm_b2/brick<br>Brick9: cld-cnvirt-h03-storage:/bricks/vm_b2/brick<br>Brick10: cld-cnvirt-h04-storage:/bricks/vm_b2/brick<br>Brick11: cld-cnvirt-h05-storage:/bricks/vm_b2/brick<br>Brick12: cld-cnvirt-h06-storage:/bricks/vm_b2/brick<br>Brick13: cld-cnvirt-h01-storage:/bricks/vm_b3/brick<br>Brick14: cld-cnvirt-h02-storage:/bricks/vm_b3/brick<br>Brick15: cld-cnvirt-h03-storage:/bricks/vm_b3/brick<br>Brick16: cld-cnvirt-h04-storage:/bricks/vm_b3/brick<br>Brick17: cld-cnvirt-h05-storage:/bricks/vm_b3/brick<br>Brick18: cld-cnvirt-h06-storage:/bricks/vm_b3/brick<br>Brick19: cld-cnvirt-h01-storage:/bricks/vm_b4/brick<br>Brick20: cld-cnvirt-h02-storage:/bricks/vm_b4/brick<br>Brick21: cld-cnvirt-h03-storage:/bricks/vm_b4/brick<br>Brick22: cld-cnvirt-h04-storage:/bricks/vm_b4/brick<br>Brick23: cld-cnvirt-h05-storage:/bricks/vm_b4/brick<br>Brick24: cld-cnvirt-h06-storage:/bricks/vm_b4/brick<br>Brick25: cld-cnvirt-h01-storage:/bricks/vm_b5/brick<br>Brick26: cld-cnvirt-h02-storage:/bricks/vm_b5/brick<br>Brick27: cld-cnvirt-h03-storage:/bricks/vm_b5/brick<br>Brick28: cld-cnvirt-h04-storage:/bricks/vm_b5/brick<br>Brick29: cld-cnvirt-h05-storage:/bricks/vm_b5/brick<br>Brick30: cld-cnvirt-h06-storage:/bricks/vm_b5/brick<br>Brick31: cld-cnvirt-h01-storage:/bricks/vm_b6/brick<br>Brick32: cld-cnvirt-h02-storage:/bricks/vm_b6/brick<br>Brick33: cld-cnvirt-h03-storage:/bricks/vm_b6/brick<br>Brick34: cld-cnvirt-h04-storage:/bricks/vm_b6/brick<br>Brick35: cld-cnvirt-h05-storage:/bricks/vm_b6/brick<br>Brick36: cld-cnvirt-h06-storage:/bricks/vm_b6/brick<br>Options Reconfigured:<br>nfs.disable: on<br>storage.fips-mode-rchecksum: on<br>performance.strict-o-direct: on<br>network.remote-dio: off<br>storage.owner-uid: 36<br>storage.owner-gid: 36<br>network.ping-timeout: 30</font><br></div><div><font face="monospace"><br></font></div><div><font face="arial, sans-serif">I have tried many different options but unfortunately have the same results. I have the same problem in three different clusters (same versions).</font></div><div><font face="arial, sans-serif"><br></font></div><div><font face="arial, sans-serif">Any suggestions?</font></div><div><font face="arial, sans-serif"><br>Thanks,<br>Marco</font></div><div><font face="arial, sans-serif"><br></font></div></div>