[Gluster-users] single problematic node (brick)

Tue May 20 18:16:21 UTC 2014

Hello,

	I have a rather simple Gluster configuration that consists of 85TB 
distributed across six nodes. There is one particular node that seems to 
fail on a ~ weekly basis, and I can't figure out why.

I have attached my Gluster configuration and a recent log file from the 
problematic node. For a user, when the failure occurs, the symptom is 
that any attempts to access the Gluster volume from the problematic node 
fails with "transport endpoint not connected" error.

Restarting the Gluster daemons and remounting the volume on the failed 
node always fixes the problem. But usually by that point some number of 
jobs in our batch queue have failed b/c of this issue already, and it's 
becoming a headache.

It could be a fuse issue, since I see many related error messages in the 
Gluster log, but I can't disentangle the various errors. The relevant 
line in my /etc/fstab file is

server:global /global glusterfs 
defaults,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log 
0 0

Any ideas on the source of the problem? Could it be a hardware (network) 
glitch? The fact that it only happens on one node that is identically 
configured (with same hardware) as other nodes points to something like 
that.

thanks! Doug
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gluster.log.gz
Type: application/gzip
Size: 19765 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140520/4eff4638/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gluster.cfg.gz
Type: application/gzip
Size: 429 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140520/4eff4638/attachment-0001.bin>