[Gluster-users] [EXTERNAL] Re: New Gluster volume (10.3) not healing symlinks after brick offline

Matt Rubright mrubrigh at uncc.edu
Fri Feb 24 15:57:39 UTC 2023


Hi Eli,

Thanks for the response. I had hoped for a simple fix here, but I think
perhaps there isn't one. I have built this as a part of a new environment,
eventually replacing a much older system built with Gluster 3.10 (yes -
that old). I appreciate the warning about 10.3 and will run some
comparative load testing against it and 9.5 both.

- Matt

On Fri, Feb 24, 2023 at 8:46 AM Eli V <eliventer at gmail.com> wrote:

> I've seen issues with symlinks failing to heal as well. I never found
> a good solution on the glusterfs side of things. Most reliable fix I
> found is just rm and recreate the symlink in the fuse volume itself.
> Also, I'd strongly suggest heavy load testing before upgrading to 10.3
> in production, after upgrading from 9.5 -> 10.3 I've seen frequent
> brick process crashes(glusterfsd), whereas 9.5 was quite stable.
>
> On Mon, Jan 23, 2023 at 3:58 PM Matt Rubright <mrubrigh at uncc.edu> wrote:
> >
> > Hi friends,
> >
> > I have recently built a new replica 3 arbiter 1 volume on 10.3 servers
> and have been putting it through its paces before getting it ready for
> production use. The volume will ultimately contain about 200G of web
> content files shared among multiple frontends. Each will use the gluster
> fuse client to connect.
> >
> > What I am experiencing sounds very much like this post from 9 years ago:
> https://lists.gnu.org/archive/html/gluster-devel/2013-12/msg00103.html
> >
> > In short, if I perform these steps I can reliably end up with symlinks
> on the volume which will not heal either by initiating a 'full heal' from
> the cluster or using a fuse client to read each file:
> >
> > 1) Verify that all nodes are healthy, the volume is healthy, and there
> are no items needing to be healed
> > 2) Cleanly shut down one server hosting a brick
> > 3) Copy data, including some symlinks, from a fuse client to the volume
> > 4) Bring the brick back online and observe the number and type of items
> needing to be healed
> > 5) Initiate a full heal from one of the nodes
> > 6) Confirm that while files and directories are healed, symlinks are not
> >
> > Please help me determine if I have improper expectations here. I have
> some basic knowledge of managing gluster volumes, but I may be
> misunderstanding intended behavior.
> >
> > Here is the volume info and heal data at each step of the way:
> >
> > *** Verify that all nodes are healthy, the volume is healthy, and there
> are no items needing to be healed ***
> >
> > # gluster vol info cwsvol01
> >
> > Volume Name: cwsvol01
> > Type: Replicate
> > Volume ID: 7b28e6e6-4a73-41b7-83fe-863a45fd27fc
> > Status: Started
> > Snapshot Count: 0
> > Number of Bricks: 1 x (2 + 1) = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: glfs02-172-20-1:/data/brick01/cwsvol01
> > Brick2: glfs01-172-20-1:/data/brick01/cwsvol01
> > Brick3: glfsarb01-172-20-1:/data/arb01/cwsvol01 (arbiter)
> > Options Reconfigured:
> > performance.client-io-threads: off
> > nfs.disable: on
> > transport.address-family: inet
> > storage.fips-mode-rchecksum: on
> > cluster.granular-entry-heal: on
> >
> > # gluster vol status
> > Status of volume: cwsvol01
> > Gluster process                             TCP Port  RDMA Port  Online
> Pid
> >
> ------------------------------------------------------------------------------
> > Brick glfs02-172-20-1:/data/brick01/cwsvol0
> > 1                                           50253     0          Y
>  1397
> > Brick glfs01-172-20-1:/data/brick01/cwsvol0
> > 1                                           56111     0          Y
>  1089
> > Brick glfsarb01-172-20-1:/data/arb01/cwsvol
> > 01                                          54517     0          Y
>  118704
> > Self-heal Daemon on localhost               N/A       N/A        Y
>  1413
> > Self-heal Daemon on glfs01-172-20-1         N/A       N/A        Y
>  3490
> > Self-heal Daemon on glfsarb01-172-20-1      N/A       N/A        Y
>  118720
> >
> > Task Status of Volume cwsvol01
> >
> ------------------------------------------------------------------------------
> > There are no active volume tasks
> >
> > # gluster vol heal cwsvol01 info summary
> > Brick glfs02-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfs01-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > *** Cleanly shut down one server hosting a brick ***
> >
> > *** Copy data, including some symlinks, from a fuse client to the volume
> ***
> >
> > # gluster vol heal cwsvol01 info summary
> > Brick glfs02-172-20-1:/data/brick01/cwsvol01
> > Status: Transport endpoint is not connected
> > Total Number of entries: -
> > Number of entries in heal pending: -
> > Number of entries in split-brain: -
> > Number of entries possibly healing: -
> >
> > Brick glfs01-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 810
> > Number of entries in heal pending: 810
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> > Status: Connected
> > Total Number of entries: 810
> > Number of entries in heal pending: 810
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > *** Bring the brick back online and observe the number and type of
> entities needing to be healed ***
> >
> > # gluster vol heal cwsvol01 info summary
> > Brick glfs02-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfs01-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 769
> > Number of entries in heal pending: 769
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> > Status: Connected
> > Total Number of entries: 769
> > Number of entries in heal pending: 769
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > *** Initiate a full heal from one of the nodes ***
> >
> > # gluster vol heal cwsvol01 info summary
> > Brick glfs02-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 0
> > Number of entries in heal pending: 0
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfs01-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Total Number of entries: 148
> > Number of entries in heal pending: 148
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > Brick glfsarb01-172-20-1:/data/arb01/cwsvol01
> > Status: Connected
> > Total Number of entries: 148
> > Number of entries in heal pending: 148
> > Number of entries in split-brain: 0
> > Number of entries possibly healing: 0
> >
> > # gluster vol heal cwsvol01 info
> > Brick glfs02-172-20-1:/data/brick01/cwsvol01
> > Status: Connected
> > Number of entries: 0
> >
> > Brick glfs01-172-20-1:/data/brick01/cwsvol01
> > /web01-etc
> > /web01-etc/nsswitch.conf
> > /web01-etc/swid/swidtags.d
> > /web01-etc/swid/swidtags.d/redhat.com
> > /web01-etc/os-release
> > /web01-etc/system-release
> > < truncated >
> >
> > *** Verify that one brick contains the symlink while the
> previously-offline one does not ***
> >
> > [root at cws-glfs01 ~]# ls -ld
> /data/brick01/cwsvol01/web01-etc/nsswitch.conf
> > lrwxrwxrwx 2 root root 29 Jan  4 16:00
> /data/brick01/cwsvol01/web01-etc/nsswitch.conf ->
> /etc/authselect/nsswitch.conf
> >
> > [root at cws-glfs02 ~]# ls -ld
> /data/brick01/cwsvol01/web01-etc/nsswitch.conf
> > ls: cannot access '/data/brick01/cwsvol01/web01-etc/nsswitch.conf': No
> such file or directory
> >
> > *** Note entries in /var/log/gluster/glustershd.log ***
> >
> > [2023-01-23 20:34:40.939904 +0000] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote
> operation failed. [{source=<gfid:3cade471-8aba-492a-b981-d63330d2e02e>},
> {target=(null)}, {errno=116}, {error=Stale file handle}]
> > [2023-01-23 20:34:40.945774 +0000] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote
> operation failed. [{source=<gfid:35102340-9409-4d88-a391-da43c00644e7>},
> {target=(null)}, {errno=116}, {error=Stale file handle}]
> > [2023-01-23 20:34:40.749715 +0000] W [MSGID: 114031]
> [client-rpc-fops_v2.c:2457:client4_0_link_cbk] 0-cwsvol01-client-1: remote
> operation failed. [{source=<gfid:874406a9-9478-4b83-9e6a-09e262e4b85d>},
> {target=(null)}, {errno=116}, {error=Stale file handle}]
> >
> >
> > ________
> >
> >
> >
> > Community Meeting Calendar:
> >
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20230224/90338830/attachment.html>


More information about the Gluster-users mailing list