[Gluster-users] GlusterFS replication hangs (deadlock?) when 2 nodes attempts to create/delete the same file at the same time
Jonathan Demers
demers.jonathan at gmail.com
Mon Mar 28 17:35:35 UTC 2011
Hi guys,
We have setup GlusterFS replication (mirror) with 2 nodes (latest version
3.1.3). Each node has the server process and the client process running. We
have stripped down the configuration to the minimum.
Client configuration (same for both nodes):
volume remote1
type protocol/client
option transport-type tcp
option remote-host glusterfs1
option remote-subvolume brick
end-volume
volume remote2
type protocol/client
option transport-type tcp
option remote-host glusterfs2
option remote-subvolume brick
end-volume
volume replicate
type cluster/replicate
subvolumes remote1 remote2
end-volume
Server configuration (same for both nodes):
volume storage
type storage/posix
option directory /storage
end-volume
volume brick
type features/locks
subvolumes storage
end-volume
volume server
type protocol/server
option transport-type tcp
option auth.addr.brick.allow XXX.*
subvolumes brick
end-volume
We start everything up: GlusterFS client mounted on /mnt/gluster. We can see
the replication works fine: we can create a file in /mnt/gluster of one node
and we see it appearing in /mnt/gluster of the other node. We also see the
file appearing in the /storage of both nodes.
However, if we go on *both *nodes and run the following script in
/mnt/gluster:
while true; do touch foo; rm foo; done
GlusterFS just hangs. Every call on /mnt/gluster will just hang as well...
on both node. Even "ls -l /mnt/gluster" hangs. However, the storage
filesystem is just fine, we can do "ls -l /storage" and we see the file
"foo". "ps" shows that the script on each node is stuck in "rm foo". We
cannot stop the script, even with "ctrl-C" and "kill -9". After 30 minutes
(the default frame-timeout), GlusterFS unlocks, but the cluster is just
broken after that: file sharing does not even works (creating file on one
node and we can't see it on the other node). We can manually restart the
GlusterFS servers and clients and everything is fine after that. We can
reproduce that problem very easily with the simple script.
GlusterFS looked very promising and we planned to use it in our new HA
architecture, but the fact that a simple sequence of standard commands could
lock up the whole system is a big show stopper for us. Did you experience
that problem before? Is there a way to fix it (with configuration or other)?
Many thanks
Jonathan
More information about the Gluster-users
mailing list