[Gluster-users] Glusterfs-2 locks/hangs on EC2/vtun setup

Wed May 13 14:20:44 UTC 2009

Hi all,

I'm trying to get a glusterfs cluster working inside Amazon's EC2. I'm using the official ubuntu 8.10 images.

I've compiled glusterfs-2, but I'm using the in-kernel fuse module as the instances run 2.6.27-3, and the fuse module from glusterfs won't compile against something that recent.

For my test setup I'm trying to get AFR working over two servers, with another server as a client. All 3 servers have the glusterfs volume mounted.

After a few hours, the mounted volume on all three servers locks. ls'ing a directory on the volume, or typing "df -h" hangs, and won't even 'kill -9'. I have to umount --force the glusterfs volume to get ls or df to terminate.

The images communicate with each other over vtun-based tunnels, which I've set up to provide a predictable IP addressing system between the nodes. (IPs assigned by amazon are random.)

The logs don't show anything useful. The last thing it tells me about is the handshake that took place a few hours ago.

I disabled the performance translators on the clients, but forgot to do so on the server, so I'm currently running the test again with iothreads disabled on the server, and also "mount -o log-level=DEBUG" on the client.

The volumes are not under heavy load at all, when they fail. All that's happening to them is a script is running every 30 seconds on the client that isn't running as a storage node, and does the following things:
* Writes a random value to a randomly-named file on the locally mounted volume
* Connects via SSH to one of the storage nodes, and reads the file from the locally mounted volume
* Complains if the contents of the file are different
* Removes the file
* Repeats for the other node

In order to remount the volume after failure, I have to umount --force, and then manually kill the glusterfs process. Otherwise the connection just hangs again as soon as I remount.

On each storage node, my glusterfs-client.vol looks like this:

#------------------
volume web_remote_1
  type protocol/client
  option transport-type tcp
  option remote-host 192.168.172.10
  option remote-subvolume web_brick
end-volume

volume web_remote_2
  type protocol/client
  option transport-type tcp
  option remote-host 192.168.172.11
  option remote-subvolume web_brick
end-volume

volume web_replicate
  type cluster/replicate
  subvolumes web_remote_1 web_remote_2
end-volume
#------------------

On the servers, my glusterfs-server.vol looks like this:

#------------------
volume web
  type storage/posix
  option directory /var/glusterfs/web
end-volume

volume web_locks
  type features/locks
  subvolumes web
end-volume

volume web_brick
  type performance/io-threads
  option autoscaling on
  subvolumes web_locks
end-volume

volume web_server
  type protocol/server
  option transport-type tcp/server
  option client-volume-filename /etc/glusterfs/glusterfs-client-web.vol
  subvolumes web_brick
  option auth.addr.web_brick.allow *
end-volume
#------------------

Does anyone have any ideas why this happens?

Thanks,
Simon

-- 
Simon Detheridge - CTO, Widgit Software
26 Queen Street, Cubbington, CV32 7NA - Tel: +44 (0)1926 333680