[Gluster-users] Glusterfs-2 locks/hangs on EC2/vtun setup
Simon Detheridge
simon at widgit.com
Wed May 13 14:20:44 UTC 2009
Hi all,
I'm trying to get a glusterfs cluster working inside Amazon's EC2. I'm using the official ubuntu 8.10 images.
I've compiled glusterfs-2, but I'm using the in-kernel fuse module as the instances run 2.6.27-3, and the fuse module from glusterfs won't compile against something that recent.
For my test setup I'm trying to get AFR working over two servers, with another server as a client. All 3 servers have the glusterfs volume mounted.
After a few hours, the mounted volume on all three servers locks. ls'ing a directory on the volume, or typing "df -h" hangs, and won't even 'kill -9'. I have to umount --force the glusterfs volume to get ls or df to terminate.
The images communicate with each other over vtun-based tunnels, which I've set up to provide a predictable IP addressing system between the nodes. (IPs assigned by amazon are random.)
The logs don't show anything useful. The last thing it tells me about is the handshake that took place a few hours ago.
I disabled the performance translators on the clients, but forgot to do so on the server, so I'm currently running the test again with iothreads disabled on the server, and also "mount -o log-level=DEBUG" on the client.
The volumes are not under heavy load at all, when they fail. All that's happening to them is a script is running every 30 seconds on the client that isn't running as a storage node, and does the following things:
* Writes a random value to a randomly-named file on the locally mounted volume
* Connects via SSH to one of the storage nodes, and reads the file from the locally mounted volume
* Complains if the contents of the file are different
* Removes the file
* Repeats for the other node
In order to remount the volume after failure, I have to umount --force, and then manually kill the glusterfs process. Otherwise the connection just hangs again as soon as I remount.
On each storage node, my glusterfs-client.vol looks like this:
#------------------
volume web_remote_1
type protocol/client
option transport-type tcp
option remote-host 192.168.172.10
option remote-subvolume web_brick
end-volume
volume web_remote_2
type protocol/client
option transport-type tcp
option remote-host 192.168.172.11
option remote-subvolume web_brick
end-volume
volume web_replicate
type cluster/replicate
subvolumes web_remote_1 web_remote_2
end-volume
#------------------
On the servers, my glusterfs-server.vol looks like this:
#------------------
volume web
type storage/posix
option directory /var/glusterfs/web
end-volume
volume web_locks
type features/locks
subvolumes web
end-volume
volume web_brick
type performance/io-threads
option autoscaling on
subvolumes web_locks
end-volume
volume web_server
type protocol/server
option transport-type tcp/server
option client-volume-filename /etc/glusterfs/glusterfs-client-web.vol
subvolumes web_brick
option auth.addr.web_brick.allow *
end-volume
#------------------
Does anyone have any ideas why this happens?
Thanks,
Simon
--
Simon Detheridge - CTO, Widgit Software
26 Queen Street, Cubbington, CV32 7NA - Tel: +44 (0)1926 333680
More information about the Gluster-users
mailing list