<div style="font-family: Arial, sans-serif; font-size: 14px;">Hello,</div><div style="font-family: Arial, sans-serif; font-size: 14px;">I've run into an issue with Gluster 11.1 and need some assistance. I have a 4+1 dispersed gluster setup consisting of 20 nodes and 200 bricks. This setup was 15 nodes and 150 bricks until last week and was working flawlessly. We needed more space so we expanded the volume by adding 5 more nodes and 50 bricks.</div><div style="font-family: Arial, sans-serif; font-size: 14px;"><br></div><div style="font-family: Arial, sans-serif; font-size: 14px;">We added the nodes and triggered a fix-layout command. Unknown to us at the time, one of the five new nodes had a hardware issue, the CPU cooling fan was bad. This caused the node to throttle down to 500mhz on all cores and eventually shut itself down mid fix-layout. Due to how our ISP works, we could only replace the entire node, so we did and executed a replace-brick command.</div><div style="font-family: Arial, sans-serif; font-size: 14px;"><br></div><div style="font-family: Arial, sans-serif; font-size: 14px;">Presently this is the state we are in and I'm not sure how best to proceed to fix the errors and behavior I'm seeing. I'm not sure if running another fix-layout command again should be the next step or not given hundreds of objects are stuck in a persistent heal state, and the fact that doing just about any command other than status, info or heal volume info, results in all client mounts hanging for ~5m or bricks start to drop. The client logs show numerous anomolies as well such as:</div><div style="font-family: Arial, sans-serif; font-size: 14px;"><br></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span>[2023-11-10 17:41:52.153423 +0000] W [MSGID: 122040] [ec-common.c:1262:ec_prepare_update_cbk] 0-media-disperse-30: Failed to get size and version : FOP : 'XATTROP' failed on '/path/to/folder' with gfid 0d295c94-5577-4445-9e57-6258f24d22c5. Parent FOP: OPENDIR [Input/output error]</span><br></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><br></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span>[2023-11-10 17:48:46.965415 +0000] E [MSGID: 122038] [ec-dir-read.c:398:ec_manager_readdir] 0-media-disperse-36: EC is not winding readdir: FOP : 'READDIRP' failed on gfid f8ad28d0-05b4-4df3-91ea-73fabf27712c. Parent FOP: No Parent [File descriptor in bad state]</span><br></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><br></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span>[2023-11-10 17:39:46.076149 +0000] I [MSGID: 109018] [dht-common.c:1840:dht_revalidate_cbk] 0-media-dht: Mismatching layouts for /path/to/folder2, gfid = f04124e5-63e6-4ddf-9b6b-aa47770f90f2 </span><br></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><br></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span>[2023-11-10 17:39:18.463421 +0000] E [MSGID: 122034] [ec-common.c:662:ec_log_insufficient_vol] 0-media-disperse-4: Insufficient available children for this request: Have : 0, Need : 4 : Child UP : 11111 Mask: 00000, Healing : 00000 : FOP : 'XATTROP' failed on '/path/to/another/folder with gfid f04124e5-63e6-4ddf-9b6b-aa47770f90f2. Parent FOP: SETXATTR </span><br></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><br></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span>[2023-11-10 17:36:21.565681 +0000] W [MSGID: 122006] [ec-combine.c:188:ec_iatt_combine] 0-media-disperse-39: Failed to combine iatt (inode: 13324146332441721129-13324146332441721129, links: 2-2, uid: 1000-1000, gid: 1000-1001, rdev: 0-0, size: 10-10, mode: 40775-40775), FOP : 'LOOKUP' failed on '/path/to/yet/another/folder'. Parent FOP: No Parent </span><br></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><br></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span>[2023-11-10 17:39:46.147299 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2563:client4_0_lookup_cbk] 0-media-client-1: remote operation failed. [{path=/path/to/folder3}, {gfid=00000000-0000-0000-0000-000000000000}, {errno=13}, {error=Permission denied}] </span><br></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><br></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span>[2023-11-10 17:39:46.093069 +0000] W [MSGID: 114061] [client-common.c:1232:client_pre_readdirp_v2] 0-media-client-14: remote_fd is -1. EBADFD [{gfid=f04124e5-63e6-4ddf-9b6b-aa47770f90f2}, {errno=77}, {error=File descriptor in bad state}] </span><br></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><br></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span>[2023-11-10 17:55:11.407630 +0000] E [MSGID: 122038] [ec-dir-read.c:398:ec_manager_readdir] 0-media-disperse-30: EC is not winding readdir: FOP : 'READDIRP' failed on gfid 2bba7b7e-7a4b-416a-80f0-dd50caffd2c2. Parent FOP: No Parent [File descriptor in bad state]</span><br></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span><br></span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>[2023-11-10 17:39:46.076179 +0000] W [MSGID: 109221] [dht-selfheal.c:2023:dht_selfheal_directory] 0-media-dht: Directory selfheal failed [{path=/path/to/folder7}, {misc=2}, {unrecoverable-errors}, {gfid=f04124e5-63e6-4ddf-9b6b-aa47770f90f2}] </span><br></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span><br></span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Something about this failed expansion has caused these errors and I'm not sure how to proceed. Right now doing just about anything causes the client mounts to hang for up to 5 minutes including restarting a node, trying to use a volume set command, etc. I tried increasing a cache timeout value and ~153 bricks out of 200 dropped offline. Restarting a node seems to cause the mounts to hang as well.</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span><br></span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>I've tried:</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>running a gluster volume heal volumename full - will cause mounts to hang for 3-5m but seems to proceed</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Running ls -alhR against volume to trigger heals</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Tried removing new bricks, which triggers a rebalance which fails almost immediately, and most of the self-heal agents go offline as well</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Turned off bit-rot to reduce load on system</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Replace a brick with a new brick (same drive, new dir.) Attempted force as well.</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Changed heal mode from diff to full</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Lowered parallel heal count to 4</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span><br></span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>When I replaced the one brick, the heal count dropped on that brick from ~100 to ~6, however, those 6 are folders in the root of the volume vs subfolders many layers in. I suspect this is causing a lot of the issues I'm seeing and I don't know how to resolve this without damaging any of the existing data.</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span><br></span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>I'm hoping its just due to the fix layout failing and that just needs to run again but wanted to seek guidance from the group as to not make things worse. I'm not opposed to losing the data already copied to the new bricks, I just need to know how to do so without damaging the data on the original 150 bricks. </span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span><br></span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>I did notice something else odd as well which I'm not sure is pertinent or not, but on one of the original 15 nodes, if I go to /data/brick1/volume dir and to an ls -l, the permissions show 1000:1000, which is how it is on the actual fuse mount as well. If I do the same on one of the new bricks, it shows root:root. I didn't alter any of this, again as to not cause more problems. </span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span><br></span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>Thanks in advance for any guidance/help.</span></span></span></span></span></span></div><div style="font-family: Arial, sans-serif; font-size: 14px;"><span><span><span><span><span><span>-Ed</span></span></span></span></span></span></div><div class="protonmail_signature_block" style="font-family: Arial, sans-serif; font-size: 14px;">
</div>