[Bugs] [Bug 1776264] New: RFE: systemd should restart glusterd on crash

Mon Nov 25 11:41:51 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1776264

            Bug ID: 1776264
           Summary: RFE: systemd should restart glusterd on crash
           Product: GlusterFS
           Version: mainline
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: glusterd
          Keywords: Improvement, ZStream
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: srakonde at redhat.com
                CC: bmekala at redhat.com, bugs at gluster.org,
                    jstrunk at redhat.com, mchangir at redhat.com, pasik at iki.fi,
                    puebele at redhat.com, rhs-bugs at redhat.com,
                    sheggodu at redhat.com, storage-qa-internal at redhat.com,
                    vbellur at redhat.com
        Depends On: 1663557
  Target Milestone: ---
    Classification: Community

Description of problem:
Currently, systemd is used to manage glusterd, but after the initial start, it
does not ensure glusterd continues to run. Within limits, systemd should
attempt to restart glusterd if it crashes in order to better handle transient
failures.

Version-Release number of selected component (if applicable):
glusterfs-fuse-3.12.2-25.el7rhgs.x86_64
python2-gluster-3.12.2-25.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-25.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-25.el7rhgs.x86_64
glusterfs-cli-3.12.2-25.el7rhgs.x86_64
glusterfs-api-3.12.2-25.el7rhgs.x86_64
glusterfs-3.12.2-25.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-server-3.12.2-25.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
pcp-pmda-gluster-4.3.0-0.201812061439.git24488c63.el7.x86_64
glusterfs-geo-replication-3.12.2-25.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
glusterfs-rdma-3.12.2-25.el7rhgs.x86_64

How reproducible:
100%... if glusterd crashes, it stays down.

Steps to Reproduce:
1. Encounter glusterd SEGV
2. Observe the lack of restart

Actual results:
Glusterd is not automatically restarted on failure

Expected results:
For occasional crashes, we should use systemd to restart glusterd

Additional info:
This request comes from my experience maintaining openshift.io. We encounter
periodic crashes of gd, usually due to monitoring operations. In order to have
automatic recovery from these crashes, I have adjusted the unit file as
follows...
In the [Service] section, I have added:

StartLimitBurst=3
StartLimitIntervalSec=3600
StartLimitInterval=3600
Restart=on-abnormal
RestartSec=60

The above causes systemd to automatically restart glusterd if it crashes. It
will restart up to 3 times over a 1 hour period. This has the effect of masking
the occasional failure, but will leave the daemon down if failures exceed the
threshold (at which point other monitoring will raise an alert).

We should consider incorporating the above (or a variant thereof) into the
standard distribution.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1663557
[Bug 1663557] RFE: systemd should restart glusterd on crash
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.