[Gluster-users] Does replace-brick migrate data?

Ravishankar N ravishankar at redhat.com
Fri May 31 04:57:18 UTC 2019


On 31/05/19 3:20 AM, Alan Orth wrote:
> Dear Ravi,
>
> I spent a bit of time inspecting the xattrs on some files and 
> directories on a few bricks for this volume and it looks a bit messy. 
> Even if I could make sense of it for a few and potentially heal them 
> manually, there are millions of files and directories in total so 
> that's definitely not a scalable solution. After a few missteps with 
> `replace-brick ... commit force` in the last week—one of which on a 
> brick that was dead/offline—as well as some premature `remove-brick` 
> commands, I'm unsure how how to proceed and I'm getting demotivated. 
> It's scary how quickly things get out of hand in distributed systems...
Hi Alan,
The one good thing about gluster is it that the data is always available 
directly on the backed bricks even if your volume has inconsistencies at 
the gluster level. So theoretically, if your cluster is FUBAR, you could 
just create a new volume and copy all data onto it via its mount from 
the old volume's bricks.
>
> I had hoped that bringing the old brick back up would help, but by the 
> time I added it again a few days had passed and all the brick-id's had 
> changed due to the replace/remove brick commands, not to mention that 
> the trusted.afr.$volume-client-xx values were now probably pointing to 
> the wrong bricks (?).
>
> Anyways, a few hours ago I started a full heal on the volume and I see 
> that there is a sustained 100MiB/sec of network traffic going from the 
> old brick's host to the new one. The completed heals reported in the 
> logs look promising too:
>
> Old brick host:
>
> # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E 
> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
>  281614 Completed data selfheal
>      84 Completed entry selfheal
>  299648 Completed metadata selfheal
>
> New brick host:
>
> # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E 
> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
>  198256 Completed data selfheal
>   16829 Completed entry selfheal
>  229664 Completed metadata selfheal
>
> So that's good I guess, though I have no idea how long it will take or 
> if it will fix the "missing files" issue on the FUSE mount. I've 
> increased cluster.shd-max-threads to 8 to hopefully speed up the heal 
> process.
The afr xattrs should not cause files to disappear from mount. If the 
xattr names do not match what each AFR subvol expects (for eg. in a 
replica 2 volume, trusted.afr.*-client-{0,1} for 1st subvol, 
client-{2,3} for 2nd subvol and so on - ) for its children then it won't 
heal the data, that is all. But in your case I see some inconsistencies 
like one brick having the actual file (licenseserver.cfg) and the other 
having a linkto file (the one with thedht.linkto xattr) /in the same 
replica pair/.
>
> I'd be happy for any advice or pointers,

Did you check if the .glusterfs hardlinks/symlinks exist and are in 
order for all bricks?

-Ravi

>
> On Wed, May 29, 2019 at 5:20 PM Alan Orth <alan.orth at gmail.com 
> <mailto:alan.orth at gmail.com>> wrote:
>
>     Dear Ravi,
>
>     Thank you for the link to the blog post series—it is very
>     informative and current! If I understand your blog post correctly
>     then I think the answer to your previous question about pending
>     AFRs is: no, there are no pending AFRs. I have identified one file
>     that is a good test case to try to understand what happened after
>     I issued the `gluster volume replace-brick ... commit force` a few
>     days ago and then added the same original brick back to the volume
>     later. This is the current state of the replica 2
>     distribute/replicate volume:
>
>     [root at wingu0 ~]# gluster volume info apps
>
>     Volume Name: apps
>     Type: Distributed-Replicate
>     Volume ID: f118d2da-79df-4ee1-919d-53884cd34eda
>     Status: Started
>     Snapshot Count: 0
>     Number of Bricks: 3 x 2 = 6
>     Transport-type: tcp
>     Bricks:
>     Brick1: wingu3:/mnt/gluster/apps
>     Brick2: wingu4:/mnt/gluster/apps
>     Brick3: wingu05:/data/glusterfs/sdb/apps
>     Brick4: wingu06:/data/glusterfs/sdb/apps
>     Brick5: wingu0:/mnt/gluster/apps
>     Brick6: wingu05:/data/glusterfs/sdc/apps
>     Options Reconfigured:
>     diagnostics.client-log-level: DEBUG
>     storage.health-check-interval: 10
>     nfs.disable: on
>
>     I checked the xattrs of one file that is missing from the volume's
>     FUSE mount (though I can read it if I access its full path
>     explicitly), but is present in several of the volume's bricks
>     (some with full size, others empty):
>
>     [root at wingu0 ~]# getfattr -d -m. -e hex
>     /mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
>
>     getfattr: Removing leading '/' from absolute path names # file:
>     mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
>     security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>     trusted.afr.apps-client-3=0x000000000000000000000000
>     trusted.afr.apps-client-5=0x000000000000000000000000
>     trusted.afr.dirty=0x000000000000000000000000
>     trusted.bit-rot.version=0x0200000000000000585a396f00046e15
>     trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd [root at wingu05 ~]#
>     getfattr -d -m. -e hex
>     /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>     getfattr: Removing leading '/' from absolute path names # file:
>     data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>     security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>     trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>     trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>     trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>     [root at wingu05 ~]# getfattr -d -m. -e hex
>     /data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
>     getfattr: Removing leading '/' from absolute path names # file:
>     data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
>     security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>     trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>     trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>     [root at wingu06 ~]# getfattr -d -m. -e hex
>     /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>     getfattr: Removing leading '/' from absolute path names # file:
>     data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>     security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>     trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>     trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>     trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>
>     According to the trusted.afr.apps-client-xxxattrs this particular
>     file should be on bricks with id "apps-client-3" and
>     "apps-client-5". It took me a few hours to realize that the
>     brick-id values are recorded in the volume's volfiles in
>     /var/lib/glusterd/vols/apps/bricks. After comparing those brick-id
>     values with a volfile backup from before the replace-brick, I
>     realized that the files are simply on the wrong brick now as far
>     as Gluster is concerned. This particular file is now on the brick
>     for "apps-client-4". As an experiment I copied this one file to
>     the two bricks listed in the xattrs and I was then able to see the
>     file from the FUSE mount (yay!).
>
>     Other than replacing the brick, removing it, and then adding the
>     old brick on the original server back, there has been no change in
>     the data this entire time. Can I change the brick IDs in the
>     volfiles so they reflect where the data actually is? Or perhaps
>     script something to reset all the xattrs on the files/directories
>     to point to the correct bricks?
>
>     Thank you for any help or pointers,
>
>     On Wed, May 29, 2019 at 7:24 AM Ravishankar N
>     <ravishankar at redhat.com <mailto:ravishankar at redhat.com>> wrote:
>
>
>         On 29/05/19 9:50 AM, Ravishankar N wrote:
>>
>>
>>         On 29/05/19 3:59 AM, Alan Orth wrote:
>>>         Dear Ravishankar,
>>>
>>>         I'm not sure if Brick4 had pending AFRs because I don't know
>>>         what that means and it's been a few days so I am not sure I
>>>         would be able to find that information.
>>         When you find some time, have a look at a blog
>>         <http://wp.me/peiBB-6b> series I wrote about AFR- I've tried
>>         to explain what one needs to know to debug replication
>>         related issues in it.
>
>         Made a typo error. The URL for the blog is https://wp.me/peiBB-6b
>
>         -Ravi
>
>>>
>>>         Anyways, after wasting a few days rsyncing the old brick to
>>>         a new host I decided to just try to add the old brick back
>>>         into the volume instead of bringing it up on the new host. I
>>>         created a new brick directory on the old host, moved the old
>>>         brick's contents into that new directory (minus the
>>>         .glusterfs directory), added the new brick to the volume,
>>>         and then did Vlad's find/stat trick¹ from the brick to the
>>>         FUSE mount point.
>>>
>>>         The interesting problem I have now is that some files don't
>>>         appear in the FUSE mount's directory listings, but I can
>>>         actually list them directly and even read them. What could
>>>         cause that?
>>         Not sure, too many variables in the hacks that you did to
>>         take a guess. You can check if the contents of the .glusterfs
>>         folder are in order on the new brick (example hardlink for
>>         files and symlinks for directories are present etc.) .
>>         Regards,
>>         Ravi
>>>
>>>         Thanks,
>>>
>>>         ¹
>>>         https://lists.gluster.org/pipermail/gluster-users/2018-February/033584.html
>>>
>>>         On Fri, May 24, 2019 at 4:59 PM Ravishankar N
>>>         <ravishankar at redhat.com <mailto:ravishankar at redhat.com>> wrote:
>>>
>>>
>>>             On 23/05/19 2:40 AM, Alan Orth wrote:
>>>>             Dear list,
>>>>
>>>>             I seem to have gotten into a tricky situation. Today I
>>>>             brought up a shiny new server with new disk arrays and
>>>>             attempted to replace one brick of a replica 2
>>>>             distribute/replicate volume on an older server using
>>>>             the `replace-brick` command:
>>>>
>>>>             # gluster volume replace-brick homes
>>>>             wingu0:/mnt/gluster/homes
>>>>             wingu06:/data/glusterfs/sdb/homes commit force
>>>>
>>>>             The command was successful and I see the new brick in
>>>>             the output of `gluster volume info`. The problem is
>>>>             that Gluster doesn't seem to be migrating the data,
>>>
>>>             `replace-brick` definitely must heal (not migrate) the
>>>             data. In your case, data must have been healed from
>>>             Brick-4 to the replaced Brick-3. Are there any errors in
>>>             the self-heal daemon logs of Brick-4's node? Does
>>>             Brick-4 have pending AFR xattrs blaming Brick-3? The doc
>>>             is a bit out of date. replace-brick command internally
>>>             does all the setfattr steps that are mentioned in the doc.
>>>
>>>             -Ravi
>>>
>>>
>>>>             and now the original brick that I replaced is no longer
>>>>             part of the volume (and a few terabytes of data are
>>>>             just sitting on the old brick):
>>>>
>>>>             # gluster volume info homes | grep -E "Brick[0-9]:"
>>>>             Brick1: wingu4:/mnt/gluster/homes
>>>>             Brick2: wingu3:/mnt/gluster/homes
>>>>             Brick3: wingu06:/data/glusterfs/sdb/homes
>>>>             Brick4: wingu05:/data/glusterfs/sdb/homes
>>>>             Brick5: wingu05:/data/glusterfs/sdc/homes
>>>>             Brick6: wingu06:/data/glusterfs/sdc/homes
>>>>
>>>>             I see the Gluster docs have a more complicated
>>>>             procedure for replacing bricks that involves
>>>>             getfattr/setfattr¹. How can I tell Gluster about the
>>>>             old brick? I see that I have a backup of the old
>>>>             volfile thanks to yum's rpmsave function if that helps.
>>>>
>>>>             We are using Gluster 5.6 on CentOS 7. Thank you for any
>>>>             advice you can give.
>>>>
>>>>             ¹
>>>>             https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-faulty-brick
>>>>
>>>>             -- 
>>>>             Alan Orth
>>>>             alan.orth at gmail.com <mailto:alan.orth at gmail.com>
>>>>             https://picturingjordan.com
>>>>             https://englishbulgaria.net
>>>>             https://mjanja.ch
>>>>             "In heaven all the interesting people are missing."
>>>>             ―Friedrich Nietzsche
>>>>
>>>>             _______________________________________________
>>>>             Gluster-users mailing list
>>>>             Gluster-users at gluster.org  <mailto:Gluster-users at gluster.org>
>>>>             https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>>         -- 
>>>         Alan Orth
>>>         alan.orth at gmail.com <mailto:alan.orth at gmail.com>
>>>         https://picturingjordan.com
>>>         https://englishbulgaria.net
>>>         https://mjanja.ch
>>>         "In heaven all the interesting people are missing."
>>>         ―Friedrich Nietzsche
>>
>>         _______________________________________________
>>         Gluster-users mailing list
>>         Gluster-users at gluster.org  <mailto:Gluster-users at gluster.org>
>>         https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
>     -- 
>     Alan Orth
>     alan.orth at gmail.com <mailto:alan.orth at gmail.com>
>     https://picturingjordan.com
>     https://englishbulgaria.net
>     https://mjanja.ch
>     "In heaven all the interesting people are missing." ―Friedrich
>     Nietzsche
>
>
>
> -- 
> Alan Orth
> alan.orth at gmail.com <mailto:alan.orth at gmail.com>
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190531/132b5048/attachment.html>


More information about the Gluster-users mailing list