[Gluster-devel] client reconnect

Fri May 25 22:21:45 UTC 2007

I mentioned in a previous email that client reconnection may not be 100%. 
I encountered this again in the following scenario: one of my servers (in 
a multiserver unify/afr) was trying to format a bad drive, and this 
knocked out access to all my 3ware disks which were being exported by 
GlusterFS from that machine.  While in this condition, a couple of clients 
tried to ls directories on a filesystem that uses this server (and its 
mirror).  I suspect they were able to contact the glusterfsd of the "bad" 
machine, but glusterfsd deadlocked trying to access the disk.  I ended up 
rebooting the server, but the clients that were trying to ls never 
returned and had to be killed.  The mountpoints had to be unmounted and 
the filesystem remounted.

It seems to me (you will probably come up with something much better) 
that if the client successfully communicates a request to a server but the 
server doesn't complete the request, the client needs to timeout the I/O 
request that it was waiting on and try again.  In the case of afr, it 
should also check to see if the mirror host can satisfy the request, 
instead.

Thanks,

Brent