<div dir="ltr">Was a volume with existing data got converted to sharding volume?</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 27, 2021 at 5:06 AM Erik Jacobson <<a href="mailto:erik.jacobson@hpe.com">erik.jacobson@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Shortly after the sharded volume is made, there are some fuse mount<br>
messages. I'm not 100% sure if this was just before or during the<br>
big qemu-img command to make the 5T image<br>
(qemu-img create -f raw -o preallocation=falloc<br>
/adminvm/images/adminvm.img 5T)<br>
<br>
<br>
(from /var/log/glusterfs/adminvm.log)<br>
[2021-01-26 19:18:21.287697] I [fuse-bridge.c:5166:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.31<br>
[2021-01-26 19:18:21.287719] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: switched to graph 0<br>
[2021-01-26 19:18:23.945566] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-2: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.7 (00000000-0000-0000-0000-000000000000) [No data available]<br>
[2021-01-26 19:18:54.089721] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.85 (00000000-0000-0000-0000-000000000000) [No data available]<br>
[2021-01-26 19:18:54.089784] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-1: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.85 (00000000-0000-0000-0000-000000000000) [No data available]<br>
[2021-01-26 19:18:55.048613] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-1: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.88 (00000000-0000-0000-0000-000000000000) [No data available]<br>
[2021-01-26 19:18:55.355131] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.89 (00000000-0000-0000-0000-000000000000) [No data available]<br>
[2021-01-26 19:18:55.981094] W [MSGID: 114031] [client-rpc-fops_v2.c:2633:client4_0_lookup_cbk] 0-adminvm-client-0: remote operation failed. Path: /.shard/0cb55720-2288-46c2-bd7e-5d9bd23b40bd.91 (00000000-0000-0000-0000-000000000000) [No data available]<br>
......<br>
<br>
<br>
Towards the end (or just after, it's hard to tell) of the qemu-img<br>
create command, these msgs showed up in the adminvm.log. I just supplied<br>
the first few. There were many:<br>
<br>
<br>
[2021-01-26 19:28:40.652898] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/48bb5288-e27e-46c9-9f7c-944a804df361.1: dentry not found in 48bb5288-e27e-46c9-9f7c-944a804df361<br>
[2021-01-26 19:28:40.652975] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/931508ed-9368-4982-a53e-7187a9f0c1f9.3: dentry not found in 931508ed-9368-4982-a53e-7187a9f0c1f9<br>
[2021-01-26 19:28:40.653047] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/e808ecab-2e70-4ef3-954e-ce1b78ed8b52.4: dentry not found in e808ecab-2e70-4ef3-954e-ce1b78ed8b52<br>
[2021-01-26 19:28:40.653102] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/2c62c383-d869-4655-9c03-f08a86a874ba.6: dentry not found in 2c62c383-d869-4655-9c03-f08a86a874ba<br>
[2021-01-26 19:28:40.653169] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/556ffbc9-bcbe-445a-93f5-13784c5a6df1.2: dentry not found in 556ffbc9-bcbe-445a-93f5-13784c5a6df1<br>
[2021-01-26 19:28:40.653218] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/5d414e7c-335d-40da-bb96-6c427181338b.5: dentry not found in 5d414e7c-335d-40da-bb96-6c427181338b<br>
[2021-01-26 19:28:40.653314] W [MSGID: 101159] [inode.c:1212:__inode_unlink] 0-inode: be318638-e8a0-4c6d-977d-7a937aa84806/43364dc9-2d8e-4fca-89d2-e11dee6fcfd4.8: dentry not found in 43364dc9-2d8e-4fca-89d2-e11dee6fcfd4<br>
.....<br>
<br>
<br>
So now I installed Linux in to a VM using the above as the VM image.<br>
There were no additional fuse messages while the admin VM was being<br>
installed with our installer (via qemu on the same physical node the<br>
above messages appeared and same node where I ran qemu-img create).<br>
<br>
Rebooted the virtual machine and it booted fine. No new messages in<br>
fuse log. So now it's officially booted. This was 'reboot' so qemu<br>
didn't restart.<br>
<br>
halted the vm with 'halt', then in virt-manager did a forced shut down.<br>
<br>
started vm from scratch.<br>
<br>
Still no new messages and it booted fine.<br>
<br>
Powered off a physical node and brought it back, still fine.<br>
Reset all physical nodes and brought them back, still fine.<br>
<br>
I am unable to trigger this problem. However, once it starts to go bad,<br>
it stays bad and stays bad across all the physical nodes. The kpartx<br>
mount root from within the image then umount it trick is only a<br>
temporary fix that doesn't persist beyond one boot once we're in the<br>
bad state.<br>
<br>
So something gets in to a bad state and stays that way but we don't know<br>
how to cause it to happen at will. I will continue to try to reproduce<br>
this as it's causing some huge problems in the field.<br>
<br>
<br>
<br>
<br>
On Tue, Jan 26, 2021 at 07:40:19AM -0600, Erik Jacobson wrote:<br>
> Thank you so much for responding! More below.<br>
> <br>
> <br>
> > Anything in the logs of the fuse mount? can you stat the file from the mount?<br>
> > also, the report of an image is only 64M makes me think about Sharding as the<br>
> > default value of Shard size is 64M.<br>
> > Do you have any clues on when this issue start to happen? was there any<br>
> > operation done to the Gluster cluster?<br>
> <br>
> <br>
> - I had just created the gluster volumes within an hour of the problem<br>
> to test the vary problem I reported. So it was a "fresh start".<br>
> <br>
> - It booted one or two times, then stopped booting. Once it couldn't<br>
> boot, all 3 nodes were the same in that grub2 couldn't boot in the VM<br>
> image.<br>
> <br>
> As for the fuse log, I did see a couple of these before it happened the<br>
> first time. I'm not sure if it's a clue or not.<br>
> <br>
> [2021-01-25 22:48:19.310467] I [fuse-bridge.c:5777:fuse_graph_sync] 0-fuse: switched to graph 0<br>
> [2021-01-25 22:50:09.693958] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory<br>
> [2021-01-25 22:50:09.694462] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17a)[0x7f914e346faa] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x874a)[0x7f914a3d374a] (--> /usr/lib64/glusterfs/7.2/xlator/mount/fuse.so(+0x91cb)[0x7f914a3d41cb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7f914cf184f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7f914c76afbf] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory<br>
> <br>
> <br>
> <br>
> I have reserved the test system again. My plans today are:<br>
> - Start over with the gluster volume on the machine with sles15sp2<br>
> updates<br>
> <br>
> - Learn if there are modifications to the image (besides<br>
> mounting/umounting filesystems with the image using kpartx to map<br>
> them to force it to work). What if I add/remove a byte from the end<br>
> of the image file for example.<br>
> <br>
> - Revert the setup to sles15sp2 with no updates. My theory is the<br>
> updates are not making a difference and it's just random chance.<br>
> (re-making the gluster volume in the process)<br>
> <br>
> - The 64MB shard size made me think too!!<br>
> <br>
> - If the team feels it is worth it, I could try a newer gluster. We're<br>
> using the versions we've validated at scale when we have large<br>
> clusters in the factory but if the team thinks I should try something<br>
> else I'm happy to re-build it!!! We are @ 7.2 plus afr-event-gen-changes<br>
> patch.<br>
> <br>
> I will keep a better eye on the fuse log to tie an error to the problem<br>
> starting.<br>
> <br>
> <br>
> THANKS AGAIN for responding and let me know if you have any more<br>
> clues!<br>
> <br>
> Erik<br>
> <br>
> <br>
> > <br>
> > On Tue, Jan 26, 2021 at 2:40 AM Erik Jacobson <<a href="mailto:erik.jacobson@hpe.com" target="_blank">erik.jacobson@hpe.com</a>> wrote:<br>
> > <br>
> > Hello all. Thanks again for gluster. We're having a strange problem<br>
> > getting virtual machines started that are hosted on a gluster volume.<br>
> > <br>
> > One of the ways we use gluster now is to make a HA-ish cluster head<br>
> > node. A virtual machine runs in the shared storage and is backed up by 3<br>
> > physical servers that contribute to the gluster storage share.<br>
> > <br>
> > We're using sharding in this volume. The VM image file is around 5T and<br>
> > we use qemu-img with falloc to get all the blocks allocated in advance.<br>
> > <br>
> > We are not using gfapi largely because it would mean we have to build<br>
> > our own libvirt and qemu and we'd prefer not to do that. So we're using<br>
> > a glusterfs fuse mount to host the image. The virtual machine is using<br>
> > virtio disks but we had similar trouble using scsi emulation.<br>
> > <br>
> > The issue: - all seems well, the VM head node installs, boots, etc.<br>
> > <br>
> > However, at some point, it stops being able to boot! grub2 acts like it<br>
> > cannot find /boot. At the grub2 prompt, it can see the partitions, but<br>
> > reports no filesystem found where there are indeed filesystems.<br>
> > <br>
> > If we switch qemu to use "direct kernel load" (bypass grub2), this often<br>
> > works around the problem but in one case Linux gave us a clue. Linux<br>
> > reported /dev/vda as only being 64 megabytes, which would explain a lot.<br>
> > This means the virtual machine Linux though the disk supplied by the<br>
> > disk image was tiny! 64M instead of 5T<br>
> > <br>
> > We are using sles15sp2 and hit the problem more often with updates<br>
> > applied than without. I'm in the process of trying to isolate if there<br>
> > is a sles15sp2 update causing this, or if we're within "random chance".<br>
> > <br>
> > On one of the physical nodes, if it is in the failure mode, if I use<br>
> > 'kpartx' to create the partitions from the image file, then mount the<br>
> > giant root filesystem (ie mount /dev/mapper/loop0p31 /mnt) and then<br>
> > umount /mnt, then that physical node starts the VM fine, grub2 loads,<br>
> > the virtual machine is fully happy! Until I try to shut it down and<br>
> > start it up again, at which point it sticks at grub2 again! What about<br>
> > mounting the image file makes it so qemu sees the whole disk?<br>
> > <br>
> > The problem doesn't always happen but once it starts, the same VM image has<br>
> > trouble starting on any of the 3 physical nodes sharing the storage.<br>
> > But using the trick to force-mount the root within the image with<br>
> > kpartx, then the machine can come up. My only guess is this changes the<br>
> > file just a tiny bit in the middle of the image.<br>
> > <br>
> > Once the problem starts, it keeps happening except temporarily working<br>
> > when I do the loop mount trick on the physical admin.<br>
> > <br>
> > <br>
> > Here is some info about what I have in place:<br>
> > <br>
> > <br>
> > nano-1:/adminvm/images # gluster volume info<br>
> > <br>
> > Volume Name: adminvm<br>
> > Type: Replicate<br>
> > Volume ID: 67de902c-8c00-4dc9-8b69-60b93b5f6104<br>
> > Status: Started<br>
> > Snapshot Count: 0<br>
> > Number of Bricks: 1 x 3 = 3<br>
> > Transport-type: tcp<br>
> > Bricks:<br>
> > Brick1: 172.23.255.151:/data/brick_adminvm<br>
> > Brick2: 172.23.255.152:/data/brick_adminvm<br>
> > Brick3: 172.23.255.153:/data/brick_adminvm<br>
> > Options Reconfigured:<br>
> > performance.client-io-threads: on<br>
> > nfs.disable: on<br>
> > storage.fips-mode-rchecksum: on<br>
> > transport.address-family: inet<br>
> > performance.quick-read: off<br>
> > performance.read-ahead: off<br>
> > performance.io-cache: off<br>
> > performance.low-prio-threads: 32<br>
> > network.remote-dio: enable<br>
> > cluster.eager-lock: enable<br>
> > cluster.quorum-type: auto<br>
> > cluster.server-quorum-type: server<br>
> > cluster.data-self-heal-algorithm: full<br>
> > cluster.locking-scheme: granular<br>
> > cluster.shd-max-threads: 8<br>
> > cluster.shd-wait-qlength: 10000<br>
> > features.shard: on<br>
> > user.cifs: off<br>
> > cluster.choose-local: off<br>
> > client.event-threads: 4<br>
> > server.event-threads: 4<br>
> > cluster.granular-entry-heal: enable<br>
> > storage.owner-uid: 439<br>
> > storage.owner-gid: 443<br>
> > <br>
> > <br>
> > <br>
> > <br>
> > libglusterfs0-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64<br>
> > glusterfs-7.2-4723.1520.210122T1700.a.sles15sp2hpe.x86_64<br>
> > python3-gluster-7.2-4723.1520.210122T1700.a.sles15sp2hpe.noarch<br>
> > <br>
> > <br>
> > <br>
> > nano-1:/adminvm/images # uname -a<br>
> > Linux nano-1 5.3.18-24.46-default #1 SMP Tue Jan 5 16:11:50 UTC 2021<br>
> > (4ff469b) x86_64 x86_64 x86_64 GNU/Linux<br>
> > nano-1:/adminvm/images # rpm -qa | grep qemu-4<br>
> > qemu-4.2.0-9.4.x86_64<br>
> > <br>
> > <br>
> > <br>
> > Would love any advice!!!!<br>
> > <br>
> > <br>
> > Erik<br>
> > ________<br>
> > <br>
> > <br>
> > <br>
> > Community Meeting Calendar:<br>
> > <br>
> > Schedule -<br>
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
> > Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" rel="noreferrer" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>
> > Gluster-users mailing list<br>
> > <a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
> > <a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
> > <br>
> > <br>
> > <br>
> > --<br>
> > Respectfully<br>
> > Mahdi<br>
________<br>
<br>
<br>
<br>
Community Meeting Calendar:<br>
<br>
Schedule -<br>
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" rel="noreferrer" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">--<div><a href="https://kadalu.io" target="_blank">https://kadalu.io</a></div><div>Container Storage made easy!</div><div><br></div></div></div>