[Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected

Tue Nov 25 11:27:48 UTC 2008

Hi devels!

We consider GlusterFS as parallel file server (8 server nodes) for our  
parallel Opteron cluster (88 nodes, ~500 cores), as well as for a  
unified nufa /scratch distributed over all nodes. We use the cluster  
within a scientific environment (theoretical physics) and use  
Scientific Linux with kernel 2.6.25.16. After similar problems with  
1.3.x we installed 1.4.0qa61 and set up a /scratch for testing using  
the following script "glusterconf.sh" which runs local on all nodes on  
startup and writes the two config files /usr/local/etc/glusterfs- 
{server,client}.vol:

---------------------------------- 8< snip >8  
----------------------------------
#!/bin/sh

HOST=$(hostname -s)

if [ $HOST = master ];then
     MASTER_IP=127.0.0.1
     HOST_IP=127.0.0.1
     HOST_N=0
else
     MASTER_IP=192.168.1.254
     HOST_IP=$(hostname -i)
     HOST_N=${HOST_IP##*.}
fi

LOCAL=sc$HOST_N

###################################################################
# write /usr/local/etc/glusterfs-server.vol
{

cat <<EOF
###
### Server config automatically created by $PWD/$0
###

EOF

if [ $HOST = master ];then
     SERVERVOLUMES="scns"
     cat <<EOF
volume scns
   type storage/posix
   option directory /export/scratch_ns
end-volume

EOF
else # if master
     SERVERVOLUMES=""
fi   # if master

SERVERVOLUMES="$SERVERVOLUMES $LOCAL"
cat <<EOF
volume $LOCAL-posix
   type storage/posix
   option directory /export/scratch
end-volume

volume $LOCAL-locks
   type features/posix-locks
   subvolumes $LOCAL-posix
end-volume

volume $LOCAL-ioth
   type performance/io-threads
   option thread-count 4
   subvolumes $LOCAL-locks
end-volume

volume $LOCAL
   type performance/read-ahead
   subvolumes $LOCAL-ioth
end-volume

volume server
   type protocol/server
   option transport-type tcp/server
   subvolumes $SERVERVOLUMES
EOF

for vol in $SERVERVOLUMES;do
     cat <<EOF
   option auth.addr.$vol.allow 127.0.0.1,192.168.1.*
EOF
done

cat <<EOF
end-volume

EOF

} > /usr/local/etc/glusterfs-server.vol

###################################################################
# write /usr/local/etc/glusterfs-client.vol
{
cat <<EOF
###
### Client config automatically created by $PWD/$0
###

volume scns
   type protocol/client
   option transport-type tcp/client
   option remote-host $MASTER_IP
   option remote-subvolume scns
end-volume

volume sc0
   type protocol/client
   option transport-type tcp/client
   option remote-host $MASTER_IP
   option remote-subvolume sc0
end-volume

EOF

UNIFY="sc0"

# leave out node66 at the moment...

for n in $(seq 65) $(seq 67 87);do
     VOL=sc$n
     UNIFY="$UNIFY $VOL"
	cat <<EOF
volume $VOL
   type protocol/client
   option transport-type tcp/client
   option remote-host 192.168.1.$n
   option remote-subvolume $VOL
end-volume

EOF
done

cat <<EOF
volume scratch
   type cluster/unify
   subvolumes $UNIFY
   option namespace scns
   option scheduler nufa
   option nufa.limits.min-free-disk 15
   option nufa.refresh-interval 10
   option nufa.local-volume-name $LOCAL
end-volume

volume scratch-io-threads
   type performance/io-threads
   option thread-count 4
   subvolumes scratch
end-volume

volume scratch-write-behind
   type performance/write-behind
   option aggregate-size 128kB
   option flush-behind off
   subvolumes scratch-io-threads
end-volume

volume scratch-read-ahead
   type performance/read-ahead
   option page-size 128kB # unit in bytes
   option page-count 2    # cache per file  = (page-count x page-size)
   subvolumes scratch-write-behind
end-volume

volume scratch-io-cache
   type performance/io-cache
   option cache-size 64MB
   option page-size 512kB
   subvolumes scratch-read-ahead
end-volume

EOF

} > /usr/local/etc/glusterfs-client.vol
---------------------------------- 8< snip >8  
----------------------------------

The cluster uses MPI over Infiniband, while GlusterFS runs over TCP/IP  
Gigabit Ethernet. I use FUSE 2.7.4 with patch fuse-2.7.3glfs10.diff  
(Is that OK? The patch succeeded)

Everything is fine until some nodes which are used by a job block on  
access to /scratch or, sometimes later, give

df: `/scratch': Transport endpoint is not connected

The glusterfs.log on node36 is flooded by

2008-11-25 07:30:35 E [client-protocol.c:243:call_bail] sc70:  
activating bail-out. pending frames = 3. last sent = 2008-11-25  
07:29:52. last received = 2008-11-25 07:29:49. transport-timeout = 42
2008-11-25 07:30:35 C [client-protocol.c:250:call_bail] sc70: bailing  
transport
...(~100MB)

(~2 lines for every node every 10 seconds) Furthermore, I find at the  
end of glusterfs.log:

grep -v call_bail glusterfs.log
...
2008-11-25 10:00:46 E [socket.c:1187:socket_submit] sc0: transport not  
connected to submit (priv->connected = 255)
...
2008-11-25 10:00:46 E [socket.c:1187:socket_submit] sc87: transport  
not connected to submit (priv->connected = 255)
2008-11-25 10:00:46 E [socket.c:1187:socket_submit] scns: transport  
not connected to submit (priv->connected = 255)
2008-11-25 10:05:03 E [fuse-bridge.c:1886:fuse_statfs_cbk] glusterfs- 
fuse: 1353: ERR => -1 (Transport endpoint is not connected)

On node68 I find

2008-11-24 23:20:12 W [client-protocol.c:93:this_ino_set] sc0: inode  
number(201326854) changed for inode(0x6130d0)
2008-11-24 23:20:12 W [client-protocol.c:93:this_ino_set] scns: inode  
number(37749030) changed for inode(0x6130d0)
2008-11-24 23:20:58 E [client-protocol.c:243:call_bail] scns:  
activating bail-out. pending frames = 3. last sent = 2008-11-24  
23:20:12. last received = 2008-11-24 23:20:12. transport-timeout = 42
2008-11-24 23:20:58 C [client-protocol.c:250:call_bail] scns: bailing  
transport
2008-11-24 23:20:58 E [client-protocol.c:243:call_bail] sc0:  
activating bail-out. pending frames = 3. last sent = 2008-11-24  
23:20:12. last received = 2008-11-24 23:20:12. transport-timeout = 42
2008-11-24 23:20:58 C [client-protocol.c:250:call_bail] sc0: bailing  
transport
...(~100MB)

only for scns and sc0 and then

2008-11-25 10:01:31 E [client-protocol.c:243:call_bail] sc1:  
activating bail-out. pending frames = 1. last sent = 2008-11-25  
10:00:46. last received = 2008-11-24 23:20:12. transport-timeout = 42
2008-11-25 10:01:31 C [client-protocol.c:250:call_bail] sc1: bailing  
transport
...(~100MB)

for all nodes, as well as

2008-11-25 10:00:46 E [socket.c:1187:socket_submit] sc0: transport not  
connected to submit (priv->connected = 255)
2008-11-25 10:00:46 E [socket.c:1187:socket_submit] scns: transport  
not connected to submit (priv->connected = 255)
2008-11-25 11:23:18 E [socket.c:1187:socket_submit] sc1: transport not  
connected to submit (priv->connected = 255)
2008-11-25 11:23:18 E [socket.c:1187:socket_submit] sc2: transport not  
connected to submit (priv->connected = 255)
...

The third affected node node77 says:

2008-11-24 22:07:20 W [client-protocol.c:93:this_ino_set] sc0: inode  
number(201326854) changed for inode(0x7f97d6c0ac70)
2008-11-24 22:07:20 W [client-protocol.c:93:this_ino_set] scns: inode  
number(37749030) changed for inode(0x7f97d6c0ac70)
2008-11-24 22:08:07 E [client-protocol.c:243:call_bail] sc10:  
activating bail-out. pending frames = 7. last sent = 2008-11-24  
22:07:24. last received = 2008-11-24 22:07:20. transport-timeout = 42
2008-11-24 22:08:07 C [client-protocol.c:250:call_bail] sc10: bailing  
transport
...(~100MB)

and then

2008-11-25 10:00:46 E [socket.c:1187:socket_submit] sc0: transport not  
connected to submit (priv->connected = 255)
...
2008-11-25 10:00:46 E [socket.c:1187:socket_submit] sc87: transport  
not connected to submit (priv->connected = 255)
2008-11-25 10:00:46 E [socket.c:1187:socket_submit] scns: transport  
not connected to submit (priv->connected = 255)

As I said, similar problems occurred with version 1.3.x. If these  
problems cannot be solved, we have to use a different file system, so  
any help is very appreciated.

Have fun,

      Fred

Dr. Fred Hucht <fred at thp.Uni-DuE.de>
Institute for Theoretical Physics
University of Duisburg-Essen, 47048 Duisburg, Germany