[Gluster-users] Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)

Fri Dec 2 20:12:32 UTC 2016

Hi all,

I've added more info and full /var/log/gluster contents from cluster nodes to https://bugzilla.redhat.com/show_bug.cgi?id=1381970

After many tries with configuration tweaks, the crash problem still remains and is exactly reproducible in a few minutes by generating NFS load.

Any advice whatsoever would be very welcome!

(I started thinking about migrating to NFS-Ganesha, but we cannot safely introduce a full Pacemaker/Corosync stack into a hyperconverged GlusterFS/oVirt setup, I think)

Many thanks in advance.

Best regards,
Giuseppe Ragusa

________________________________________
Da: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> per conto di Giuseppe Ragusa <giuseppe.ragusa at hotmail.com>
Inviato: martedì 29 novembre 2016 23.36
A: gluster-users at gluster.org
Oggetto: [Gluster-users] Gluster 3.7.17 distributed-replicated volume experiences almost regular Gluster internal NFS subprocess crash (CentOS 7.2)

Hi all,

I'm writing to kindly ask for help on the issue in subject line above and documented in:

https://bugzilla.redhat.com/show_bug.cgi?id=1381970

Brief recap:

a 3-node replicated (with arbiter, confined on the same dedicated node for all volumes) distributed volume cluster experiences regular nfs crashes on at least one (non arbiter) node at a time (all two non arbiter nodes crash if given enough time without enacting the workaround cited below); there are no Gluster native clients, only NFS ones, all on a dedicated network.

Simply restarting an NFS-enabled volume restarts the nfs services on all (non arbiter) nodes for all volumes and all seems well up to the next crash (crashes happen many times a day under our normal workload).

Am almost sure way of making nfs crash immediately is recreating the yum metadata directory on a CentOS7 OS mirror repo hosted on a NFS-enabled volume.

Since it is a production cluster and we had to disable various cron jobs that were regularly crashing the internal NFS Gluster part (no NFS-Ganesha in use here), I am almost ready to accept even the upgrade to 3.8.x as a solution (I dare to say so since I've seen various fixes in Gerrit that were not being backported to 3.7 and one I even reported to Bugzilla, cloning the 3.8 bug and kindly asking for a backport, given that the patch applied cleanly; this brings the question: is the backporting of patches to 3.7 being phased out if not explicitly requested for?).

The only caveat could be that the cluster is an hyperconverged setup with oVirt 3.6.7 (but the oVirt part with its dedicated Gluster volumes is working flawlessly and is absolutely not being used to manage Gluster, only to monitor it), so I would need to check for 3.8 compatibility before upgrading.

Many thanks in advance to anyone who can offer any advice on this issue.

Best regards,

Giuseppe