[Gluster-users] Instability with large setup and infiniband
Fabricio Cannini
fcannini at gmail.com
Thu Jan 6 22:55:45 UTC 2011
Hi all
I''ve setup glusterfs , version 3.0.5 ( debian squeeze amd64 stock packages )
, like this, being each node both a server and client:
Client config
http://pastebin.com/6d4BjQwd
Server config
http://pastebin.com/4ZmX9ir1
Configuration of each node:
2x Intel xeon 5420 2.5GHz , 16GB ddr2 ECC , 1 sata2 hd of 750GB.
Of which ~600GB is a partition ( /glstfs ) dedicated to gluster. Each node
also have 1 Mellanox MT25204 [InfiniHost III Lx] Inifiniband DDR hca used by
gluster through the 'verbs' interface.
This cluster of 22 nodes is used for scientific computing, and glusterfs is
used to create a scratch area for I/O intensive apps.
And this is one of the problems: *one* I/O intensive job can bring the whole
volume to its knees, with "Transport endpoint not connected" errors and so,
till complete uselessness; Especially if the job is running in a parallel way
( through MPI ) in more than one node.
The other problem is that gluster have been somewhat unstable, even without
I/O intensive jobs. Out of the blue a simple 'ls -la /scratch' is answered
with a "Transport endpoint not connected" error. But when this happens,
restarting all servers brings things back to a working state.
If anybody here using glusterfs with infiniband have been through this ( or
something like that ) and could share your experiences , please please please
do
TIA,
Fabricio.
More information about the Gluster-users
mailing list