[Gluster-devel] Mount hangs because of connection delays

Thu Jul 2 17:24:47 UTC 2015

On 07/02/2015 07:04 PM, Pranith Kumar Karampuri wrote:
> hi,
>     When glusterfs mount process is coming up all cluster xlators wait 
> for at least one event from all the children before propagating the 
> status upwards. Sometimes client xlator takes upto 2 minutes to 
> propogate this 
> event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to 
> this xavi implemented timer in ec notify where we treat a child as 
> down if it doesn't come up in 10 seconds. Similar patch went up for 
> review @http://review.gluster.org/#/c/11113 for afr. Kritika raised an 
> interesting point in the review that all cluster xlators need to have 
> this logic for the mount to not hang, and the correct place to fix it 
> would be client xlator itself. i.e. add the timer logic in client 
> xlator. Which seems like a better approach.

I think it makes sense to handle the change only in relevant cluster 
xlators like AFR/EC because of the notion of high availability 
associated with them. In my limited understanding, protocol-client is 
the originator (?) of the child up/down events. While it looks okay to 
allow cluster xlators to take certain decisions because the 'originator' 
did not respond within a specific time, altering the originator itself 
without giving a chance to the upper xlators to make choices seems 
incorrect to me.  Perhaps I'm wrong, but setting an unconditional 10 
second timer on protocol/client seems to beat the purpose of having a 
configurable `network.ping-timeout` volume set option.

Just my two cents. :)

> I just want to take inputs from everyone before we go ahead in that 
> direction.
> i.e. on PARENT_UP in client xlator it will start a timer and if no rpc 
> notification is received in that timeout it treats the client xlator 
> as down.
>
> Pranith