[Gluster-users] Does replace-brick migrate data?

Alan Orth alan.orth at gmail.com
Tue Jun 4 22:08:34 UTC 2019


Hi Ravi,

You're right that I had mentioned using rsync to copy the brick content to
a new host, but in the end I actually decided not to bring it up on a new
brick. Instead I added the original brick back into the volume. So the
xattrs and symlinks to .glusterfs on the original brick are fine. I think
the problem probably lies with a remove-brick that got interrupted. A few
weeks ago during the maintenance I had tried to remove a brick and then
after twenty minutes and no obvious progress I stopped it—after that the
bricks were still part of the volume.

In the last few days I have run a fix-layout that took 26 hours and
finished successfully. Then I started a full index heal and it has healed
about 3.3 million files in a few days and I see a clear increase of network
traffic from old brick host to new brick host over that time. Once the full
index heal completes I will try to do a rebalance.

Thank you,


On Mon, Jun 3, 2019 at 7:40 PM Ravishankar N <ravishankar at redhat.com> wrote:

>
> On 01/06/19 9:37 PM, Alan Orth wrote:
>
> Dear Ravi,
>
> The .glusterfs hardlinks/symlinks should be fine. I'm not sure how I could
> verify them for six bricks and millions of files, though... :\
>
> Hi Alan,
>
> The reason I asked this is because you had mentioned in one of your
> earlier emails that when you moved content from the old brick to the new
> one, you had skipped the .glusterfs directory. So I was assuming that when
> you added back this new brick to the cluster, it might have been missing
> the .glusterfs entries. If that is the cae, one way to verify could be to
> check using a script if all files on the brick have a link-count of at
> least 2 and all dirs have valid symlinks inside .glusterfs pointing to
> themselves.
>
>
> I had a small success in fixing some issues with duplicated files on the
> FUSE mount point yesterday. I read quite a bit about the elastic hashing
> algorithm that determines which files get placed on which bricks based on
> the hash of their filename and the trusted.glusterfs.dht xattr on brick
> directories (thanks to Joe Julian's blog post and Python script for showing
> how it works¹). With that knowledge I looked closer at one of the files
> that was appearing as duplicated on the FUSE mount and found that it was
> also duplicated on more than `replica 2` bricks. For this particular file I
> found two "real" files and several zero-size files with
> trusted.glusterfs.dht.linkto xattrs. Neither of the "real" files were on
> the correct brick as far as the DHT layout is concerned, so I copied one of
> them to the correct brick, deleted the others and their hard links, and did
> a `stat` on the file from the FUSE mount point and it fixed itself. Yay!
>
> Could this have been caused by a replace-brick that got interrupted and
> didn't finish re-labeling the xattrs?
>
> No, replace-brick only initiates AFR self-heal, which just copies the
> contents from the other brick(s) of the *same* replica pair into the
> replaced brick.  The link-to files are created by DHT when you rename a
> file from the client. If the new name hashes to a different  brick, DHT
> does not move the entire file there. It instead creates the link-to file
> (the one with the dht.linkto xattrs) on the hashed subvol. The value of
> this xattr points to the brick where the actual data is there (`getfattr -e
> text` to see it for yourself).  Perhaps you had attempted a rebalance or
> remove-brick earlier and interrupted that?
>
> Should I be thinking of some heuristics to identify and fix these issues
> with a script (incorrect brick placement), or is this something a fix
> layout or repeated volume heals can fix? I've already completed a whole
> heal on this particular volume this week and it did heal about 1,000,000
> files (mostly data and metadata, but about 20,000 entry heals as well).
>
> Maybe you should let the AFR self-heals complete first and then attempt a
> full rebalance to take care of the dht link-to files. But  if the files are
> in millions, it could take quite some time to complete.
> Regards,
> Ravi
>
> Thanks for your support,
>
> ¹ https://joejulian.name/post/dht-misses-are-expensive/
>
> On Fri, May 31, 2019 at 7:57 AM Ravishankar N <ravishankar at redhat.com>
> wrote:
>
>>
>> On 31/05/19 3:20 AM, Alan Orth wrote:
>>
>> Dear Ravi,
>>
>> I spent a bit of time inspecting the xattrs on some files and directories
>> on a few bricks for this volume and it looks a bit messy. Even if I could
>> make sense of it for a few and potentially heal them manually, there are
>> millions of files and directories in total so that's definitely not a
>> scalable solution. After a few missteps with `replace-brick ... commit
>> force` in the last week—one of which on a brick that was dead/offline—as
>> well as some premature `remove-brick` commands, I'm unsure how how to
>> proceed and I'm getting demotivated. It's scary how quickly things get out
>> of hand in distributed systems...
>>
>> Hi Alan,
>> The one good thing about gluster is it that the data is always available
>> directly on the backed bricks even if your volume has inconsistencies at
>> the gluster level. So theoretically, if your cluster is FUBAR, you could
>> just create a new volume and copy all data onto it via its mount from the
>> old volume's bricks.
>>
>>
>> I had hoped that bringing the old brick back up would help, but by the
>> time I added it again a few days had passed and all the brick-id's had
>> changed due to the replace/remove brick commands, not to mention that the
>> trusted.afr.$volume-client-xx values were now probably pointing to the
>> wrong bricks (?).
>>
>> Anyways, a few hours ago I started a full heal on the volume and I see
>> that there is a sustained 100MiB/sec of network traffic going from the old
>> brick's host to the new one. The completed heals reported in the logs look
>> promising too:
>>
>> Old brick host:
>>
>> # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E
>> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
>>  281614 Completed data selfheal
>>      84 Completed entry selfheal
>>  299648 Completed metadata selfheal
>>
>> New brick host:
>>
>> # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E
>> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
>>  198256 Completed data selfheal
>>   16829 Completed entry selfheal
>>  229664 Completed metadata selfheal
>>
>> So that's good I guess, though I have no idea how long it will take or if
>> it will fix the "missing files" issue on the FUSE mount. I've increased
>> cluster.shd-max-threads to 8 to hopefully speed up the heal process.
>>
>> The afr xattrs should not cause files to disappear from mount. If the
>> xattr names do not match what each AFR subvol expects (for eg. in a replica
>> 2 volume, trusted.afr.*-client-{0,1} for 1st subvol, client-{2,3} for 2nd
>> subvol and so on - ) for its children then it won't heal the data, that is
>> all. But in your case I see some inconsistencies like one brick having the
>> actual file (licenseserver.cfg) and the other having a linkto file (the
>> one with the dht.linkto xattr) *in the same replica pair*.
>>
>>
>> I'd be happy for any advice or pointers,
>>
>> Did you check if the .glusterfs hardlinks/symlinks exist and are in order
>> for all bricks?
>>
>> -Ravi
>>
>>
>> On Wed, May 29, 2019 at 5:20 PM Alan Orth <alan.orth at gmail.com> wrote:
>>
>>> Dear Ravi,
>>>
>>> Thank you for the link to the blog post series—it is very informative
>>> and current! If I understand your blog post correctly then I think the
>>> answer to your previous question about pending AFRs is: no, there are no
>>> pending AFRs. I have identified one file that is a good test case to try to
>>> understand what happened after I issued the `gluster volume replace-brick
>>> ... commit force` a few days ago and then added the same original brick
>>> back to the volume later. This is the current state of the replica 2
>>> distribute/replicate volume:
>>>
>>> [root at wingu0 ~]# gluster volume info apps
>>>
>>> Volume Name: apps
>>> Type: Distributed-Replicate
>>> Volume ID: f118d2da-79df-4ee1-919d-53884cd34eda
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 3 x 2 = 6
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: wingu3:/mnt/gluster/apps
>>> Brick2: wingu4:/mnt/gluster/apps
>>> Brick3: wingu05:/data/glusterfs/sdb/apps
>>> Brick4: wingu06:/data/glusterfs/sdb/apps
>>> Brick5: wingu0:/mnt/gluster/apps
>>> Brick6: wingu05:/data/glusterfs/sdc/apps
>>> Options Reconfigured:
>>> diagnostics.client-log-level: DEBUG
>>> storage.health-check-interval: 10
>>> nfs.disable: on
>>>
>>> I checked the xattrs of one file that is missing from the volume's FUSE
>>> mount (though I can read it if I access its full path explicitly), but is
>>> present in several of the volume's bricks (some with full size, others
>>> empty):
>>>
>>> [root at wingu0 ~]# getfattr -d -m. -e hex
>>> /mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>> trusted.afr.apps-client-3=0x000000000000000000000000
>>> trusted.afr.apps-client-5=0x000000000000000000000000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.bit-rot.version=0x0200000000000000585a396f00046e15
>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>>
>>> [root at wingu05 ~]# getfattr -d -m. -e hex /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>>> trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>>>
>>> [root at wingu05 ~]# getfattr -d -m. -e hex /data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>>>
>>> [root at wingu06 ~]# getfattr -d -m. -e hex /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>>> trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>>>
>>> According to the trusted.afr.apps-client-xx xattrs this particular file
>>> should be on bricks with id "apps-client-3" and "apps-client-5". It took me
>>> a few hours to realize that the brick-id values are recorded in the
>>> volume's volfiles in /var/lib/glusterd/vols/apps/bricks. After comparing
>>> those brick-id values with a volfile backup from before the replace-brick,
>>> I realized that the files are simply on the wrong brick now as far as
>>> Gluster is concerned. This particular file is now on the brick for
>>> "apps-client-4". As an experiment I copied this one file to the two
>>> bricks listed in the xattrs and I was then able to see the file from the
>>> FUSE mount (yay!).
>>>
>>> Other than replacing the brick, removing it, and then adding the old
>>> brick on the original server back, there has been no change in the data
>>> this entire time. Can I change the brick IDs in the volfiles so they
>>> reflect where the data actually is? Or perhaps script something to reset
>>> all the xattrs on the files/directories to point to the correct bricks?
>>>
>>> Thank you for any help or pointers,
>>>
>>> On Wed, May 29, 2019 at 7:24 AM Ravishankar N <ravishankar at redhat.com>
>>> wrote:
>>>
>>>>
>>>> On 29/05/19 9:50 AM, Ravishankar N wrote:
>>>>
>>>>
>>>> On 29/05/19 3:59 AM, Alan Orth wrote:
>>>>
>>>> Dear Ravishankar,
>>>>
>>>> I'm not sure if Brick4 had pending AFRs because I don't know what that
>>>> means and it's been a few days so I am not sure I would be able to find
>>>> that information.
>>>>
>>>> When you find some time, have a look at a blog <http://wp.me/peiBB-6b>
>>>> series I wrote about AFR- I've tried to explain what one needs to know to
>>>> debug replication related issues in it.
>>>>
>>>> Made a typo error. The URL for the blog is https://wp.me/peiBB-6b
>>>>
>>>> -Ravi
>>>>
>>>>
>>>> Anyways, after wasting a few days rsyncing the old brick to a new host
>>>> I decided to just try to add the old brick back into the volume instead of
>>>> bringing it up on the new host. I created a new brick directory on the old
>>>> host, moved the old brick's contents into that new directory (minus the
>>>> .glusterfs directory), added the new brick to the volume, and then did
>>>> Vlad's find/stat trick¹ from the brick to the FUSE mount point.
>>>>
>>>> The interesting problem I have now is that some files don't appear in
>>>> the FUSE mount's directory listings, but I can actually list them directly
>>>> and even read them. What could cause that?
>>>>
>>>> Not sure, too many variables in the hacks that you did to take a guess.
>>>> You can check if the contents of the .glusterfs folder are in order on the
>>>> new brick (example hardlink for files and symlinks for directories are
>>>> present etc.) .
>>>> Regards,
>>>> Ravi
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> ¹
>>>> https://lists.gluster.org/pipermail/gluster-users/2018-February/033584.html
>>>>
>>>> On Fri, May 24, 2019 at 4:59 PM Ravishankar N <ravishankar at redhat.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On 23/05/19 2:40 AM, Alan Orth wrote:
>>>>>
>>>>> Dear list,
>>>>>
>>>>> I seem to have gotten into a tricky situation. Today I brought up a
>>>>> shiny new server with new disk arrays and attempted to replace one brick of
>>>>> a replica 2 distribute/replicate volume on an older server using the
>>>>> `replace-brick` command:
>>>>>
>>>>> # gluster volume replace-brick homes wingu0:/mnt/gluster/homes
>>>>> wingu06:/data/glusterfs/sdb/homes commit force
>>>>>
>>>>> The command was successful and I see the new brick in the output of
>>>>> `gluster volume info`. The problem is that Gluster doesn't seem to be
>>>>> migrating the data,
>>>>>
>>>>> `replace-brick` definitely must heal (not migrate) the data. In your
>>>>> case, data must have been healed from Brick-4 to the replaced Brick-3. Are
>>>>> there any errors in the self-heal daemon logs of Brick-4's node? Does
>>>>> Brick-4 have pending AFR xattrs blaming Brick-3? The doc is a bit out of
>>>>> date. replace-brick command internally does all the setfattr steps that are
>>>>> mentioned in the doc.
>>>>>
>>>>> -Ravi
>>>>>
>>>>>
>>>>> and now the original brick that I replaced is no longer part of the
>>>>> volume (and a few terabytes of data are just sitting on the old brick):
>>>>>
>>>>> # gluster volume info homes | grep -E "Brick[0-9]:"
>>>>> Brick1: wingu4:/mnt/gluster/homes
>>>>> Brick2: wingu3:/mnt/gluster/homes
>>>>> Brick3: wingu06:/data/glusterfs/sdb/homes
>>>>> Brick4: wingu05:/data/glusterfs/sdb/homes
>>>>> Brick5: wingu05:/data/glusterfs/sdc/homes
>>>>> Brick6: wingu06:/data/glusterfs/sdc/homes
>>>>>
>>>>> I see the Gluster docs have a more complicated procedure for replacing
>>>>> bricks that involves getfattr/setfattr¹. How can I tell Gluster about the
>>>>> old brick? I see that I have a backup of the old volfile thanks to yum's
>>>>> rpmsave function if that helps.
>>>>>
>>>>> We are using Gluster 5.6 on CentOS 7. Thank you for any advice you can
>>>>> give.
>>>>>
>>>>> ¹
>>>>> https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-faulty-brick
>>>>>
>>>>> --
>>>>> Alan Orth
>>>>> alan.orth at gmail.com
>>>>> https://picturingjordan.com
>>>>> https://englishbulgaria.net
>>>>> https://mjanja.ch
>>>>> "In heaven all the interesting people are missing." ―Friedrich
>>>>> Nietzsche
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>>
>>>>
>>>> --
>>>> Alan Orth
>>>> alan.orth at gmail.com
>>>> https://picturingjordan.com
>>>> https://englishbulgaria.net
>>>> https://mjanja.ch
>>>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>
>>> --
>>> Alan Orth
>>> alan.orth at gmail.com
>>> https://picturingjordan.com
>>> https://englishbulgaria.net
>>> https://mjanja.ch
>>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>>
>>
>>
>> --
>> Alan Orth
>> alan.orth at gmail.com
>> https://picturingjordan.com
>> https://englishbulgaria.net
>> https://mjanja.ch
>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>
>>
>
> --
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>
>

-- 
Alan Orth
alan.orth at gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190605/4ce18a41/attachment.html>


More information about the Gluster-users mailing list