[Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected

Tue Nov 25 14:27:46 UTC 2008

Fred Hucht wrote:
> Hi,
> 
> crawling through all /var/log/messages, I found on one of the failing 
> nodes (node68)

Does your setup use local disk?  Is it possible that the backing store 
is failing?

If you run

	mcelog > /tmp/mce.log 2>&1

on the failing node, do you get any output in /tmp/mce.log ?

My current thoughts in no particular order are

hardware based: failures always concentrated on a few specific nodes 
(always repeatable only on those nodes)

a) failing local hard drive:  backing store failing *could* impact the 
file system, and you would see this as NFS working on a remote FS while 
failing on an FS in part storing locally.

b) network issue:  possibly a bad driver/flaky port/overloaded switch 
backplane.  This is IMO less likely, as NFS works.  Could you post 
output of "ifconfig" so we can look for error indicators in the port state?

Software based:

c) fuse bugs:  I have run into a few in the past, and they have caused 
errors like this.  But umount/mount rarely fixes a hung fuse process, so 
this is, again, IMO, less likely.

d) GlusterFS bugs:  I think the devels would recognize it if it were 
one.  I doubt this at this moment.

e) kernel bug:  We are using 2.6.27.5 right now, about to update to .7 
due to some Cert advisories.  We have had (stability) issues with 
kernels from 2.6.24 to 2.6.26.x (x low numbers) under intense loads.  It 
wouldn't surprise me if what you are observing is actually just a 
symptom of a real problem somewhere else in the kernel.  That the state 
gets resolved when you umount/mount suggests that this could be the case.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615