[Bugs] [Bug 1113050] Transient failures immediately after add-brick to a mounted volume

Tue Sep 30 14:12:23 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1113050

Krutika Dhananjay <kdhananj at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|needinfo?(kdhananj at redhat.c |
                   |om)                         |

--- Comment #6 from Krutika Dhananjay <kdhananj at redhat.com> ---
(In reply to Niels de Vos from comment #5)
> Krutika, the root cause in your comment #2 is pretty well explained.
> 
> Do I understand correctly that *a* fix would be to have glusterd wait with
> sending the change notification to the mount, until the (new) bricks have
> signed-in at the glusterd-portmapper? (Probably stated much simpler than it
> actually is.)
> 
> Is there an ETA to have this fixed? Can this get included in 3.5.3 that
> hopefully sees a beta during next week?

Hi Niels,

That was one solution that Pranith and I had discussed with Kaushal. We
concluded that this is an intrusive change and the current infrastructure in
glusterd doesn't allow this to be implemented easily.

For example,

consider a 2 node cluster and an add-brick done to add a second brick to a
volume from the second node, and the originator being the first node.

During commit op, glusterd on node-1 first changes its volfile and notifies
clients connected to it.

Now before node-2 does commit op and as part of the same starts the glusterfsd
process for the newly added brick, client(s) that have node-1 as volfile-server
could end up querying glusterd on node-2 for brick-2's port number and fail due
to one of the following reasons:

a) the glusterd on node-2 has launched glusterfsd associated with brick-2 but
the brick process hasnt performed a portmap signin yet, or
b) the glusterd on node-2 hasn't even changed the volfile for the volume yet,
and as a result does not even recognise there's a brick-2 for the volume.

Both the above cases need to be handled by glusterd and it should also handle
the case of glusterds not waiting wistfully to hear from the bricks before
notifying the clients, just in case the brick went down (due to a crash or
disconnect) without ever doing a pmap signin.

Getting these two problems fixed into glusterd is not a trivial task and cannot
be delivered as part of 3.5.3. :(

-Krutika

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=vuu0z8h2HY&a=cc_unsubscribe