[Gluster-users] Removing subvolume from dist/rep volume

Dave Sherohman
Fri Jun 28 14:24:54 UTC 2019

On Thu, Jun 27, 2019 at 12:17:10PM +0530, Nithya Balachandran wrote:
> There are some edge cases that may prevent a file from being migrated
> during a remove-brick. Please do the following after this:
>    1. Check the remove-brick status for any failures.  If there are any,
>    check the rebalance log file for errors.
>    2. Even if there are no failures, check the removed bricks to see if any
>    files have not been migrated. If there are any, please check that they are
>    valid files on the brick and copy them to the volume from the brick to the
>    mount point.

Well, looks like I hit one of those edge cases.  Probably because of
some issues around a reboot last September which left a handful of files
in a state where self-heal identified them as needing to be healed, but
incapable of actually healing them.  (Check the list archives for
"Kicking a stuck heal", posted on Sept 4, if you want more details.)

So I'm getting 9 failures on the arbiter (merlin), 8 on one data brick
(gandalf), and 3 on the other (saruman).  Looking in
/var/log/gluster/palantir-rebalance.log, I see those numbers of

migrate file failed: /.shard/291e9749-2d1b-47af-ad53-3a09ad4e64c6.229: failed to lock file on palantir-replicate-1 [Stale file handle]


Also, merlin has four errors, and gandalf has one, of the form:

Gfid mismatch detected for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/0f500288-ff62-4f0b-9574-53f510b4159f.2898>, 9f00c0fe-58c3-457e-a2e6-f6a006d1cfc6 on palantir-client-7 and 08bb7cdc-172b-4c21-916a-2a244c095a3e on palantir-client-1.

There are no gfid mismatches recorded on saruman.  All of the gfid
mismatches are for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806> and (on
saruman) appear to correspond to 0-byte files (e.g.,
.shard/0f500288-ff62-4f0b-9574-53f510b4159f.2898, in the case of the
gfid mismatch quoted above).

For both types of errors, all affected files are in .shard/ and have
UUID-style names, so I have no idea which actual files they belong to.
File sizes are generally either 0 bytes or 4M (exactly), although one of
them has a size slightly larger than 3M.  So I'm assuming they're chunks
of larger files (which would be almost all the files on the volume -
it's primarily holding disk image files for kvm servers).

Web searches generally seem to consider gfid mismatches to be a form of
split-brain, but `gluster volume heal palantir info split-brain` shows
"Number of entries in split-brain: 0" for all bricks, including those
bricks which are reporting gfid mismatches.

Given all that, how do I proceed with cleaning up the stale handle
issues?  I would guess that this will involve somehow converting the
shard filename to a "real" filename, then shutting down the
corresponding VM and maybe doing some additional cleanup.

And then there's the gfid mismatches.  Since they're for 0-byte files,
is it safe to just ignore them on the assumption that they only hold
metadata?  Or do I need to do some kind of split-brain resolution on
them (even though gluster says no files are in split-brain)?

Finally, a listing of /var/local/brick0/data/.shard on saruman, in case
any of the information it contains (like file sizes/permissions) might
provide clues to resolving the errors:

--- cut here ---
root at saruman:/var/local/brick0/data/.shard# ls -l
total 63996
-rw-rw---- 2 root libvirt-qemu       0 Sep 17  2018 0f500288-ff62-4f0b-9574-53f510b4159f.2864
-rw-rw---- 2 root libvirt-qemu       0 Sep 17  2018 0f500288-ff62-4f0b-9574-53f510b4159f.2868
-rw-rw---- 2 root libvirt-qemu       0 Sep 17  2018 0f500288-ff62-4f0b-9574-53f510b4159f.2879
-rw-rw---- 2 root libvirt-qemu       0 Sep 17  2018 0f500288-ff62-4f0b-9574-53f510b4159f.2898
-rw------- 2 root libvirt-qemu 4194304 May 17 14:42 291e9749-2d1b-47af-ad53-3a09ad4e64c6.229
-rw------- 2 root libvirt-qemu 4194304 Jun 24 09:10 291e9749-2d1b-47af-ad53-3a09ad4e64c6.925
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 26 12:54 2df12cb0-6cf4-44ae-8b0a-4a554791187e.266
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 26 16:30 2df12cb0-6cf4-44ae-8b0a-4a554791187e.820
-rw-r--r-- 2 root libvirt-qemu 4194304 Jun 17 20:22 323186b1-6296-4cbe-8275-b940cc9d65cf.27466
-rw-r--r-- 2 root libvirt-qemu 4194304 Jun 27 05:01 323186b1-6296-4cbe-8275-b940cc9d65cf.32575
-rw-r--r-- 2 root libvirt-qemu 3145728 Jun 11 13:23 323186b1-6296-4cbe-8275-b940cc9d65cf.3448
---------T 2 root libvirt-qemu       0 Jun 28 14:26 4cd094f4-0344-4660-98b0-83249d5bd659.22998
-rw------- 2 root libvirt-qemu 4194304 Mar 13  2018 6cdd2e5c-f49e-492b-8039-239e71577836.1302
---------T 2 root libvirt-qemu       0 Jun 28 13:22 7530a2d1-d6ec-4a04-95a2-da1f337ac1ad.47131
---------T 2 root libvirt-qemu       0 Jun 28 13:22 7530a2d1-d6ec-4a04-95a2-da1f337ac1ad.52615
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 27 08:56 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.100
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 27 11:29 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.106
-rw-rw-r-- 2 root libvirt-qemu 4194304 Jun 28 02:35 8fefae99-ed2a-4a8f-ab87-aa94c6bb6e68.137
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov  4  2018 9544617c-901c-4613-a94b-ccfad4e38af1.165
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov  4  2018 9544617c-901c-4613-a94b-ccfad4e38af1.168
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov  5  2018 9544617c-901c-4613-a94b-ccfad4e38af1.193
-rw-rw-r-- 2 root libvirt-qemu 4194304 Nov  6  2018 9544617c-901c-4613-a94b-ccfad4e38af1.3800
---------T 2 root libvirt-qemu       0 Jun 28 15:02 b48a5934-5e5b-4918-8193-6ff36f685f70.46559
-rw-rw---- 2 root libvirt-qemu       0 Oct 12  2018 c5bde2f2-3361-4d1a-9c88-28751ef74ce6.3568
-rw-r--r-- 2 root libvirt-qemu 4194304 Apr 13  2018 c953c676-152d-4826-80ff-bd307fa7f6e5.10724
-rw-r--r-- 2 root libvirt-qemu 4194304 Apr 11  2018 c953c676-152d-4826-80ff-bd307fa7f6e5.3101
--- cut here ---

