[Gluster-devel] On Gluster resiliency

Fri Dec 23 16:40:47 UTC 2016

Last few days has been tense because a R3 3.8.5 Gluster cluster that I
built has been plagued by problems.

The first symptom has been a continuous stream in the client logs of:

[2016-12-17 15:55:02.047508] E [MSGID: 108009]
[afr-open.c:187:afr_openfd_fix_open_cbk]
0-hisap-prod-1-replicate-0: Failed to open
/home/galaxy/HISAP/java/lib/java/jre1.7.0_51/jre/lib/rt.jar on subvolume
hisap-prod-1-client-2 [Transport endpoint is not connected]

followed by very frequent peer disconnections/reconnections and a
continuous stream of files to be healed on several volumes.

The problem has been traced back to a flaky X540-T2 10GBE NIC embedded
in one of the peers motherboard, that was incapable of keeping the
correct 10Gbit speed negotiation with the switch.

The motherboard has been replaced on the peer. and then the volumes
healed quickly to complete health.  All of these while the users kept
running some heavy-duty bioinformatics applications (NGS data
analysis) on top of Gluster.  No user noticed ANYTHING despite a major
hardware problem and offi-lining of a peer.

This is a RESILIENT system, in my book.

Gluster people, despite the constant stream of problems and requests
for help that you see on the ML and IRC, rest assured that you are
building a nice piece of software, at least IMHO.

Keep-up the good work and Merry Christmas.

Ivan Rossi