[Gluster-devel] gNFS service management from glusterd

Fri Feb 23 23:36:17 UTC 2018

On Fri, Feb 23, 2018 at 1:04 PM, Niels de Vos <ndevos at redhat.com> wrote:

> On Wed, Feb 21, 2018 at 08:25:21PM +0530, Atin Mukherjee wrote:
> > On Wed, Feb 21, 2018 at 4:24 PM, Xavi Hernandez <jahernan at redhat.com>
> wrote:
> >
> > > Hi all,
> > >
> > > currently glusterd sends a SIGKILL to stop gNFS, while all other
> services
> > > are stopped with a SIGTERM signal first (this can be seen in
> > > glusterd_svc_stop() function of mgmt/glusterd xlator).
> > >
> >
> > > The question is why it cannot be stopped with SIGTERM as all other
> > > services. Using SIGKILL blindly while write I/O is happening can cause
> > > multiple inconsistencies at the same time. For a replicated volume
> this is
> > > not a problem because it will take one of the replicas as the "good"
> one
> > > and continue, but for a disperse volume, if the number of
> inconsistencies
> > > is bigger than the redundancy value, a serious problem could appear.
> > >
> > > The probability of this is very small (I've tried to reproduce this
> > > problem on my laptop but I've been unable), but it exists.
> > >
> > > Is there any known issue that prevents gNFS to be stopped with a
> SIGTERM ?
> > > or can it be changed safely ?
> > >
> >
> > I firmly believe that we need to send SIGTERM as that's the right way to
> > gracefully shutdown a running process but what I'd request from NFS folks
> > to confirm if there's any background on why it was done with SIGKILL.
>
> No background about this is known to me. I had a quick look through the
> git logs, but could not find an explanation.
>
> I agree that SIGTERM would be more appropriate.
>
>

I think there were two reasons for replacing SIGTERM with SIGKILL in gNFS:

1.  To avoid races in the graceful shutdown path that would affect the
restart of gNFS process.

2.  Graceful shutdown of gNFS might have caused clients to return errors to
applications.

Improvements done for gracefully shutting down GlusterFS might have already
addressed 1. I am not entirely certain if 2. was an issue or if it still is
one. If we attempt replacing SIGKILL with SIGTERM, it would be worth
testing out these scenarios carefully.

I also see references to other SIGKILLs in glusterd and other components:

xlators/mgmt/glusterd/src/glusterd-bitd-svc.c:1
xlators/mgmt/glusterd/src/glusterd-geo-rep.c:3
xlators/mgmt/glusterd/src/glusterd-nfs-svc.c:1
xlators/mgmt/glusterd/src/glusterd-proc-mgmt.c:1
xlators/mgmt/glusterd/src/glusterd-quota.c:1
xlators/mgmt/glusterd/src/glusterd-scrub-svc.c:1
xlators/mgmt/glusterd/src/glusterd-svc-helper.c:1
xlators/mgmt/glusterd/src/glusterd-utils.c:2
xlators/nfs/server/src/nlm4.c:1

It might be worth analyzing why we need SIGKILLs and document the reason if
they are indeed necessary.

HTH,
Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180223/d6e8a332/attachment.html>