[Bugs] [Bug 1545048] [brick-mux] process termination race while killing glusterfsd on last brick detach

Wed Feb 14 15:14:09 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1545048

Jeff Darcy <jeff at pl.atyp.us> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jeff at pl.atyp.us

--- Comment #2 from Jeff Darcy <jeff at pl.atyp.us> ---
The comment probably has nothing to do with this particular problem. The detach
code is unsafe because failing to perform all cleanup actions could lead to
memory/resource leaks or perhaps even crashes in the brick daemon. Those issues
are already being addressed in a patch by Mohit, so perhaps you should talk to
him.  https://review.gluster.org/#/c/19537/

*This* problem, on the other hand (and like about a hundred similar problems
that were fixed in multiplexing before anyone but me even saw it), mostly has
to do with how glusterd interacts with brick processes. If we're trying to stop
a brick, death of the process in which it lives is a perfectly adequate
substitute for getting an RPC reply back, and the glusterd notify code should
recognize that. That's how things worked before multiplexing, when there was no
detach RPC and thus no response for one. IIRC, we just sent a SIGTERM to the
brick process and didn't wait for anything (or clean up anything like the
UNIX-domain socket). That was good enough then, so waiting for *any* signal
that the brick is gone is an improvement and should be good enough now.

If you really wanted to, you could have the brick wait for glusterd to close
its end of the socket first before going away. Somehow, though, what I suspect
would happen is that sooner or later we'd find a case where the brick daemon
and glusterd are both sitting around waiting forever for the other to close
first. Job security, I guess, but not good for our users. The way to build a
robust distributed or multi-process system is to *reduce* dependencies on
ordering of events.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.