2011/6/28 Darren Austin <darren-lists at widgit.com>:
> Also, when one of the servers disconnects, is it notmal that the client "stalls" the write until the keepalive time expires and the online servers notice one has vanished?
You can modify the parameter network.ping-timeout from 46sec to 5 or
10 second to reduce the "time stalls" of client.

> Finally, during my testing I encountered a replicable hard lock up of the client... here's the situation:
>  Server1 and Server2 in the cluster, sharing 'data-volume' (which is /data on both servers).
>  Client mounts server1:data-volume as /mnt.
>  Client begins to write a large (1 or 2 GB) file to /mnt  (I just used random data).
>  Server1 goes down part way through the write (I simulated this by iptables -j DROP'ing everything from relevant IPs).
>  Client "stalls" writes until the keepalive timeout, and then continues to send data to Server2.
>  Server1 comes back online shortly after the keepalive timeout - but BEFORE the Client has written all the data toServer2.
>  Server1 and Server2 reconnect and the writes on the Client completely hang.

I have similar problem with a file that I'm using with KVM for storage
virtual disk

> The mounted directory on the client becomes completely in-accessible when the two servers reconnect.
actualy is normal :-|

> I had to kill -9 the dd process doing the write (along with the glusterfs process on the client) in order to release the mountpoint.
If you don't kill the process and wait that all node are syncronized
all the system should return ready.

To force a syncronization of all volume you can type these command on
the client:
find <gluster-mount> -noleaf -print0 | xargs --null stat >/dev/null

... and wait


Craig Carl said me, three days ago:
 that happens because Gluster's self heal is a blocking operation. We
are working on a non-blocking self heal, we are hoping to ship it in
early September.

You can verify that directly from your client log... you can read that:
[2011-06-28 13:28:17.484646] I
[client-lk.c:617:decrement_reopen_fd_count] 0-data-volume-client-0:
last fd open'd/lock-self-heal'd - notifying CHILD-UP


