[Gluster-users] lingering <gfid:*> entries in volume heal, gluster 3.6.3

Fri Jul 15 16:25:45 UTC 2016

On Fri, 2016-07-15 at 21:41 +0530, Ravishankar N wrote:
> On 07/15/2016 09:32 PM, Kingsley wrote:
> > On Fri, 2016-07-15 at 21:06 +0530, Ravishankar N wrote:
> >> On 07/15/2016 08:48 PM, Kingsley wrote:
> >>> I don't have star installed so I used ls,
> >> Oops typo. I meant `stat`.
> >>>    but yes they all have 2 links
> >>> to them (see below).
> >>>
> >> Everything seems to be in place for the heal to happen. Can you tailf
> >> the output of shd logs on all nodes and manually launch gluster vol heal
> >> volname?
> >> Use DEBUG log level if you have to and examine the output for clues.
> > I presume I can do that with this command:
> >
> > gluster volume set callrec diagnostics.brick-log-level DEBUG
> shd is a client process, so it is diagnostics.client-log-level. This 
> would affect your mounts too.
> >
> > How can I find out what the log level is at the moment, so that I can
> > put it back afterwards?
> INFO. you can also use `gluster volume reset`.

Thanks.

> >> Also, some dumb things to check: are all the bricks really up and is the
> >> shd connected to them etc.
> > All bricks are definitely up. I just created a file on a client and it
> > appeared in all 4 bricks.
> >
> > I don't know how to tell whether the shd is connected to all of them,
> > though.
> Latest messages like "connected to client-xxx " and "disconnected from 
> client-xxx" in the shd logs. Just like in the mount logs.

This has revealed something. I'm now seeing lots of lines like this in
the shd log:

[2016-07-15 16:20:51.098152] D [afr-self-heald.c:516:afr_shd_index_sweep] 0-callrec-replicate-0: got entry: eaa43674-b1a3-4833-a946-de7b7121bb88
[2016-07-15 16:20:51.099346] D [client-rpc-fops.c:1523:client3_3_inodelk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle
[2016-07-15 16:20:51.100683] D [client-rpc-fops.c:2686:client3_3_opendir_cbk] 0-callrec-client-2: remote operation failed: Stale file handle. Path: <gfid:eaa43674-b1a3-4833-a946-de7b7121bb88> (eaa43674-b1a3-4833-a946-de7b7121bb88)
[2016-07-15 16:20:51.101180] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle
[2016-07-15 16:20:51.101663] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle
[2016-07-15 16:20:51.102056] D [client-rpc-fops.c:1627:client3_3_entrylk_cbk] 0-callrec-client-2: remote operation failed: Stale file handle

These lines continued to be written to the log even after I manually
launched the self heal (which it told me had been launched
successfully). I also tried repeating that command on one of the bricks
that was giving those messages, but that made no difference.

Client 2 would correspond to the one that had been offline, so how do I
get the shd to reconnect to that brick? I did a ps but I couldn't see
any processes with glustershd in the name, else I'd have tried sending
that a HUP.

Cheers,
Kingsley.