[Gluster-users] Does replace-brick migrate data?

Alan Orth alan.orth at gmail.com
Thu May 30 21:50:51 UTC 2019


Dear Ravi,

I spent a bit of time inspecting the xattrs on some files and directories
on a few bricks for this volume and it looks a bit messy. Even if I could
make sense of it for a few and potentially heal them manually, there are
millions of files and directories in total so that's definitely not a
scalable solution. After a few missteps with `replace-brick ... commit
force` in the last week—one of which on a brick that was dead/offline—as
well as some premature `remove-brick` commands, I'm unsure how how to
proceed and I'm getting demotivated. It's scary how quickly things get out
of hand in distributed systems...

I had hoped that bringing the old brick back up would help, but by the time
I added it again a few days had passed and all the brick-id's had changed
due to the replace/remove brick commands, not to mention that the
trusted.afr.$volume-client-xx values were now probably pointing to the
wrong bricks (?).

Anyways, a few hours ago I started a full heal on the volume and I see that
there is a sustained 100MiB/sec of network traffic going from the old
brick's host to the new one. The completed heals reported in the logs look
promising too:

Old brick host:

# grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E
'Completed (data|metadata|entry) selfheal' | sort | uniq -c
 281614 Completed data selfheal
     84 Completed entry selfheal
 299648 Completed metadata selfheal

New brick host:

# grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o -E
'Completed (data|metadata|entry) selfheal' | sort | uniq -c
 198256 Completed data selfheal
  16829 Completed entry selfheal
 229664 Completed metadata selfheal

So that's good I guess, though I have no idea how long it will take or if
it will fix the "missing files" issue on the FUSE mount. I've increased
cluster.shd-max-threads to 8 to hopefully speed up the heal process.

I'd be happy for any advice or pointers,

On Wed, May 29, 2019 at 5:20 PM Alan Orth <alan.orth at gmail.com> wrote:

> Dear Ravi,
>
> Thank you for the link to the blog post series—it is very informative and
> current! If I understand your blog post correctly then I think the answer
> to your previous question about pending AFRs is: no, there are no pending
> AFRs. I have identified one file that is a good test case to try to
> understand what happened after I issued the `gluster volume replace-brick
> ... commit force` a few days ago and then added the same original brick
> back to the volume later. This is the current state of the replica 2
> distribute/replicate volume:
>
> [root at wingu0 ~]# gluster volume info apps
>
> Volume Name: apps
> Type: Distributed-Replicate
> Volume ID: f118d2da-79df-4ee1-919d-53884cd34eda
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 3 x 2 = 6
> Transport-type: tcp
> Bricks:
> Brick1: wingu3:/mnt/gluster/apps
> Brick2: wingu4:/mnt/gluster/apps
> Brick3: wingu05:/data/glusterfs/sdb/apps
> Brick4: wingu06:/data/glusterfs/sdb/apps
> Brick5: wingu0:/mnt/gluster/apps
> Brick6: wingu05:/data/glusterfs/sdc/apps
> Options Reconfigured:
> diagnostics.client-log-level: DEBUG
> storage.health-check-interval: 10
> nfs.disable: on
>
> I checked the xattrs of one file that is missing from the volume's FUSE
> mount (though I can read it if I access its full path explicitly), but is
> present in several of the volume's bricks (some with full size, others
> empty):
>
> [root at wingu0 ~]# getfattr -d -m. -e hex
> /mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
>
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> trusted.afr.apps-client-3=0x000000000000000000000000
> trusted.afr.apps-client-5=0x000000000000000000000000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.bit-rot.version=0x0200000000000000585a396f00046e15
> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>
> [root at wingu05 ~]# getfattr -d -m. -e hex /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
> getfattr: Removing leading '/' from absolute path names
> # file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
> trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>
> [root at wingu05 ~]# getfattr -d -m. -e hex /data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
> getfattr: Removing leading '/' from absolute path names
> # file: data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>
> [root at wingu06 ~]# getfattr -d -m. -e hex /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
> getfattr: Removing leading '/' from absolute path names
> # file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
> trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
> trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>
> According to the trusted.afr.apps-client-xx xattrs this particular file
> should be on bricks with id "apps-client-3" and "apps-client-5". It took me
> a few hours to realize that the brick-id values are recorded in the
> volume's volfiles in /var/lib/glusterd/vols/apps/bricks. After comparing
> those brick-id values with a volfile backup from before the replace-brick,
> I realized that the files are simply on the wrong brick now as far as
> Gluster is concerned. This particular file is now on the brick for
> "apps-client-4". As an experiment I copied this one file to the two
> bricks listed in the xattrs and I was then able to see the file from the
> FUSE mount (yay!).
>
> Other than replacing the brick, removing it, and then adding the old brick
> on the original server back, there has been no change in the data this
> entire time. Can I change the brick IDs in the volfiles so they reflect
> where the data actually is? Or perhaps script something to reset all the
> xattrs on the files/directories to point to the correct bricks?
>
> Thank you for any help or pointers,
>
> On Wed, May 29, 2019 at 7:24 AM Ravishankar N <ravishankar at redhat.com>
> wrote:
>
>>
>> On 29/05/19 9:50 AM, Ravishankar N wrote:
>>
>>
>> On 29/05/19 3:59 AM, Alan Orth wrote:
>>
>> Dear Ravishankar,
>>
>> I'm not sure if Brick4 had pending AFRs because I don't know what that
>> means and it's been a few days so I am not sure I would be able to find
>> that information.
>>
>> When you find some time, have a look at a blog <http://wp.me/peiBB-6b>
>> series I wrote about AFR- I've tried to explain what one needs to know to
>> debug replication related issues in it.
>>
>> Made a typo error. The URL for the blog is https://wp.me/peiBB-6b
>>
>> -Ravi
>>
>>
>> Anyways, after wasting a few days rsyncing the old brick to a new host I
>> decided to just try to add the old brick back into the volume instead of
>> bringing it up on the new host. I created a new brick directory on the old
>> host, moved the old brick's contents into that new directory (minus the
>> .glusterfs directory), added the new brick to the volume, and then did
>> Vlad's find/stat trick¹ from the brick to the FUSE mount point.
>>
>> The interesting problem I have now is that some files don't appear in the
>> FUSE mount's directory listings, but I can actually list them directly and
>> even read them. What could cause that?
>>
>> Not sure, too many variables in the hacks that you did to take a guess.
>> You can check if the contents of the .glusterfs folder are in order on the
>> new brick (example hardlink for files and symlinks for directories are
>> present etc.) .
>> Regards,
>> Ravi
>>
>>
>> Thanks,
>>
>> ¹
>> https://lists.gluster.org/pipermail/gluster-users/2018-February/033584.html
>>
>> On Fri, May 24, 2019 at 4:59 PM Ravishankar N <ravishankar at redhat.com>
>> wrote:
>>
>>>
>>> On 23/05/19 2:40 AM, Alan Orth wrote:
>>>
>>> Dear list,
>>>
>>> I seem to have gotten into a tricky situation. Today I brought up a
>>> shiny new server with new disk arrays and attempted to replace one brick of
>>> a replica 2 distribute/replicate volume on an older server using the
>>> `replace-brick` command:
>>>
>>> # gluster volume replace-brick homes wingu0:/mnt/gluster/homes
>>> wingu06:/data/glusterfs/sdb/homes commit force
>>>
>>> The command was successful and I see the new brick in the output of
>>> `gluster volume info`. The problem is that Gluster doesn't seem to be
>>> migrating the data,
>>>
>>> `replace-brick` definitely must heal (not migrate) the data. In your
>>> case, data must have been healed from Brick-4 to the replaced Brick-3. Are
>>> there any errors in the self-heal daemon logs of Brick-4's node? Does
>>> Brick-4 have pending AFR xattrs blaming Brick-3? The doc is a bit out of
>>> date. replace-brick command internally does all the setfattr steps that are
>>> mentioned in the doc.
>>>
>>> -Ravi
>>>
>>>
>>> and now the original brick that I replaced is no longer part of the
>>> volume (and a few terabytes of data are just sitting on the old brick):
>>>
>>> # gluster volume info homes | grep -E "Brick[0-9]:"
>>> Brick1: wingu4:/mnt/gluster/homes
>>> Brick2: wingu3:/mnt/gluster/homes
>>> Brick3: wingu06:/data/glusterfs/sdb/homes
>>> Brick4: wingu05:/data/glusterfs/sdb/homes
>>> Brick5: wingu05:/data/glusterfs/sdc/homes
>>> Brick6: wingu06:/data/glusterfs/sdc/homes
>>>
>>> I see the Gluster docs have a more complicated procedure for replacing
>>> bricks that involves getfattr/setfattr¹. How can I tell Gluster about the
>>> old brick? I see that I have a backup of the old volfile thanks to yum's
>>> rpmsave function if that helps.
>>>
>>> We are using Gluster 5.6 on CentOS 7. Thank you for any advice you can
>>> give.
>>>
>>> ¹
>>> https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-faulty-brick
>>>
>>> --
>>> Alan Orth
>>> alan.orth at gmail.com
>>> https://picturingjordan.com
>>> https://englishbulgaria.net
>>> https://mjanja.ch
>>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>>
>>> _______________________________________________
>>> Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>
>> --
>> Alan Orth
>> alan.orth at gmail.com
>> https://picturingjordan.com
>> https://englishbulgaria.net
>> https://mjanja.ch
>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>
>>
>> _______________________________________________
>> Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>
> --
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>


-- 
Alan Orth
alan.orth at gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190531/f7909d13/attachment.html>


More information about the Gluster-users mailing list