<div dir="ltr">Hi Dave,<div><br></div><div>Yes, files in split brain are not migrated as we cannot figure out which is the good copy. Adding Ravi to look at this and see what can be done.</div><div>Also adding Krutika as this is a sharded volume.</div><div><br></div><div>The files with the "---------T" permissions are internal files and can be ignored. Ravi and Krutika, please take a look at the other files.</div><div><br></div><div>Regards,</div><div>Nithya</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 28 Jun 2019 at 19:56, Dave Sherohman <<a href="mailto:dave@sherohman.org">dave@sherohman.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, Jun 27, 2019 at 12:17:10PM +0530, Nithya Balachandran wrote:<br>
> There are some edge cases that may prevent a file from being migrated<br>
> during a remove-brick. Please do the following after this:<br>
> <br>
> 1. Check the remove-brick status for any failures. If there are any,<br>
> check the rebalance log file for errors.<br>
> 2. Even if there are no failures, check the removed bricks to see if any<br>
> files have not been migrated. If there are any, please check that they are<br>
> valid files on the brick and copy them to the volume from the brick to the<br>
> mount point.<br>
<br>
Well, looks like I hit one of those edge cases. Probably because of<br>
some issues around a reboot last September which left a handful of files<br>
in a state where self-heal identified them as needing to be healed, but<br>
incapable of actually healing them. (Check the list archives for<br>
"Kicking a stuck heal", posted on Sept 4, if you want more details.)<br>
<br>
So I'm getting 9 failures on the arbiter (merlin), 8 on one data brick<br>
(gandalf), and 3 on the other (saruman). Looking in<br>
/var/log/gluster/palantir-rebalance.log, I see those numbers of<br>
<br>
migrate file failed: /.shard/291e9749-2d1b-47af-ad53-3a09ad4e64c6.229: failed to lock file on palantir-replicate-1 [Stale file handle]<br>
<br>
errors.<br>
<br>
Also, merlin has four errors, and gandalf has one, of the form:<br>
<br>
Gfid mismatch detected for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/0f500288-ff62-4f0b-9574-53f510b4159f.2898>, 9f00c0fe-58c3-457e-a2e6-f6a006d1cfc6 on palantir-client-7 and 08bb7cdc-172b-4c21-916a-2a244c095a3e on palantir-client-1.<br>
<br>
There are no gfid mismatches recorded on saruman. All of the gfid<br>
mismatches are for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806> and (on<br>
saruman) appear to correspond to 0-byte files (e.g.,<br>
.shard/0f500288-ff62-4f0b-9574-53f510b4159f.2898, in the case of the<br>
gfid mismatch quoted above).<br>
<br>
For both types of errors, all affected files are in .shard/ and have<br>
UUID-style names, so I have no idea which actual files they belong to.<br>
File sizes are generally either 0 bytes or 4M (exactly), although one of<br>
them has a size slightly larger than 3M. So I'm assuming they're chunks<br>
of larger files (which would be almost all the files on the volume -<br>
it's primarily holding disk image files for kvm servers).<br>
<br>
Web searches generally seem to consider gfid mismatches to be a form of<br>
split-brain, but `gluster volume heal palantir info split-brain` shows<br>
"Number of entries in split-brain: 0" for all bricks, including those<br>
bricks which are reporting gfid mismatches.<br>
<br>
<br>
Given all that, how do I proceed with cleaning up the stale handle<br>
issues? I would guess that this will involve somehow converting the<br>
shard filename to a "real" filename, then shutting down the<br>
corresponding VM and maybe doing some additional cleanup.<br>
<br>
And then there's the gfid mismatches. Since they're for 0-byte files,<br>
is it safe to just ignore them on the assumption that they only hold<br>
metadata? Or do I need to do some kind of split-brain resolution on<br>
them (even though gluster says no files are in split-brain)?<br>
<br>
<br>
Finally, a listing of /var/local/brick0/data/.shard on saruman, in case<br>
any of the information it contains (like file sizes/permissions) might<br>
provide clues to resolving the errors:<br>
<br>
--- cut here ---<br>
root@saruman:/var/local/brick0/data/.shard# ls -l<br>
total 63996<br>
-rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2864<br>
-rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2868<br>
-rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2879<br>
-rw-rw---- 2 root libvirt-qemu 0 Sep 17 2018 0f500288-ff62-4f0b-9574-53f510b4159f.2898<br>
-rw------- 2 root libvirt-qemu 4194304 May 17 14:42 291e9749-2d1b-47af-ad53-3a09ad4e64c6.229<br>
-rw------- 2 root libvirt-qemu 4194304 Jun 24 09:10 291e9749-2d1b-47af-ad53-3a09ad4e64c6.925<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 26 12:54 2df12cb0-6cf4-44ae-8b0a-4a554791187e.266<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 26 16:30 2df12cb0-6cf4-44ae-8b0a-4a554791187e.820<br>
-rw-r--r-- 2 root libvirt-qemu 4194304 Jun 17 20:22 323186b1-6296-4cbe-8275-b940cc9d65cf.27466<br>
-rw-r--r-- 2 root libvirt-qemu 4194304 Jun 27 05:01 323186b1-6296-4cbe-8275-b940cc9d65cf.32575<br>
-rw-r--r-- 2 root libvirt-qemu 3145728 Jun 11 13:23 323186b1-6296-4cbe-8275-b940cc9d65cf.3448<br>
---------T 2 root libvirt-qemu 0 Jun 28 14:26 4cd094f4-0344-4660-98b0-83249d5bd659.22998<br>
-rw------- 2 root libvirt-qemu 4194304 Mar 13 2018 6cdd2e5c-f49e-492b-8039-239e71577836.1302<br>
---------T 2 root libvirt-qemu 0 Jun 28 13:22 7530a2d1-d6ec-4a04-95a2-da1f337ac1ad.47131<br>
---------T 2 root libvirt-qemu 0 Jun 28 13:22 7530a2d1-d6ec-4a04-95a2-da1f337ac1ad.52615<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 27 08:56 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.100<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 27 11:29 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.106<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 28 02:35 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.137<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 4 2018 9544617c-901c-4613-a94b-ccfad4e38af1.165<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 4 2018 9544617c-901c-4613-a94b-ccfad4e38af1.168<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 5 2018 9544617c-901c-4613-a94b-ccfad4e38af1.193<br>
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov 6 2018 9544617c-901c-4613-a94b-ccfad4e38af1.3800<br>
---------T 2 root libvirt-qemu 0 Jun 28 15:02 b48a5934-5e5b-4918-8193-6ff36f685f70.46559<br>
-rw-rw---- 2 root libvirt-qemu 0 Oct 12 2018 c5bde2f2-3361-4d1a-9c88-28751ef74ce6.3568<br>
-rw-r--r-- 2 root libvirt-qemu 4194304 Apr 13 2018 c953c676-152d-4826-80ff-bd307fa7f6e5.10724<br>
-rw-r--r-- 2 root libvirt-qemu 4194304 Apr 11 2018 c953c676-152d-4826-80ff-bd307fa7f6e5.3101<br>
--- cut here ---<br>
<br>
-- <br>
Dave Sherohman<br>
_______________________________________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
</blockquote></div>