[Gluster-users] Instability with large setup and infiniband

Thu Jan 6 22:55:45 UTC 2011

Hi all

I''ve setup glusterfs , version 3.0.5 ( debian squeeze amd64 stock packages ) 
, like this, being each node both a server and client:

Client config
http://pastebin.com/6d4BjQwd

Server config
http://pastebin.com/4ZmX9ir1

Configuration of each node:
2x Intel xeon 5420 2.5GHz , 16GB ddr2 ECC , 1 sata2 hd of 750GB.
Of which ~600GB is a partition ( /glstfs ) dedicated to gluster. Each node 
also have 1 Mellanox MT25204 [InfiniHost III Lx] Inifiniband DDR hca used by 
gluster through the 'verbs' interface.

This cluster of 22 nodes is used for scientific computing, and glusterfs is 
used to create a scratch area for I/O intensive apps.

And this is one of the problems: *one* I/O intensive job can bring the whole 
volume to its knees, with "Transport endpoint not connected" errors and so, 
till complete uselessness; Especially if the job is running in a parallel way 
( through MPI ) in more than one node.

The other problem is that gluster have been somewhat unstable, even without 
I/O intensive jobs. Out of the blue a simple 'ls -la /scratch' is answered 
with a "Transport endpoint not connected" error. But when this happens, 
restarting all servers brings things back to a working state.

If anybody here using glusterfs with infiniband have been through this ( or 
something like that ) and could share your experiences , please please please 
do

TIA,
Fabricio.