[Gluster-devel] GlusterFS hangs/fails: Transport endpoint is not connected
Joe Landman
landman at scalableinformatics.com
Tue Nov 25 14:27:46 UTC 2008
Fred Hucht wrote:
> Hi,
>
> crawling through all /var/log/messages, I found on one of the failing
> nodes (node68)
Does your setup use local disk? Is it possible that the backing store
is failing?
If you run
mcelog > /tmp/mce.log 2>&1
on the failing node, do you get any output in /tmp/mce.log ?
My current thoughts in no particular order are
hardware based: failures always concentrated on a few specific nodes
(always repeatable only on those nodes)
a) failing local hard drive: backing store failing *could* impact the
file system, and you would see this as NFS working on a remote FS while
failing on an FS in part storing locally.
b) network issue: possibly a bad driver/flaky port/overloaded switch
backplane. This is IMO less likely, as NFS works. Could you post
output of "ifconfig" so we can look for error indicators in the port state?
Software based:
c) fuse bugs: I have run into a few in the past, and they have caused
errors like this. But umount/mount rarely fixes a hung fuse process, so
this is, again, IMO, less likely.
d) GlusterFS bugs: I think the devels would recognize it if it were
one. I doubt this at this moment.
e) kernel bug: We are using 2.6.27.5 right now, about to update to .7
due to some Cert advisories. We have had (stability) issues with
kernels from 2.6.24 to 2.6.26.x (x low numbers) under intense loads. It
wouldn't surprise me if what you are observing is actually just a
symptom of a real problem somewhere else in the kernel. That the state
gets resolved when you umount/mount suggests that this could be the case.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Gluster-devel
mailing list