[Gluster-users] Unexpected behaviour during replication heal
Marco Agostini
comunelevico at gmail.com
Tue Jun 28 20:23:59 UTC 2011
2011/6/28 Darren Austin <darren-lists at widgit.com>:
>
> Also, when one of the servers disconnects, is it notmal that the client "stalls" the write until the keepalive time expires and the online servers notice one has vanished?
>
You can modify the parameter network.ping-timeout from 46sec to 5 or
10 second to reduce the "time stalls" of client.
> Finally, during my testing I encountered a replicable hard lock up of the client... here's the situation:
> Server1 and Server2 in the cluster, sharing 'data-volume' (which is /data on both servers).
> Client mounts server1:data-volume as /mnt.
> Client begins to write a large (1 or 2 GB) file to /mnt (I just used random data).
> Server1 goes down part way through the write (I simulated this by iptables -j DROP'ing everything from relevant IPs).
> Client "stalls" writes until the keepalive timeout, and then continues to send data to Server2.
> Server1 comes back online shortly after the keepalive timeout - but BEFORE the Client has written all the data toServer2.
> Server1 and Server2 reconnect and the writes on the Client completely hang.
>
I have similar problem with a file that I'm using with KVM for storage
virtual disk
> The mounted directory on the client becomes completely in-accessible when the two servers reconnect.
>
actualy is normal :-|
> I had to kill -9 the dd process doing the write (along with the glusterfs process on the client) in order to release the mountpoint.
>
If you don't kill the process and wait that all node are syncronized
all the system should return ready.
To force a syncronization of all volume you can type these command on
the client:
find <gluster-mount> -noleaf -print0 | xargs --null stat >/dev/null
... and wait
http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Triggering_Self-Heal_on_Replicate
Craig Carl said me, three days ago:
------------------------------------------------------
that happens because Gluster's self heal is a blocking operation. We
are working on a non-blocking self heal, we are hoping to ship it in
early September.
------------------------------------------------------
You can verify that directly from your client log... you can read that:
[2011-06-28 13:28:17.484646] I
[client-lk.c:617:decrement_reopen_fd_count] 0-data-volume-client-0:
last fd open'd/lock-self-heal'd - notifying CHILD-UP
Marco
More information about the Gluster-users
mailing list