[Bugs] [Bug 1145000] New: Spec %post server does not wait for the old glusterd to exit

Mon Sep 22 08:08:00 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1145000

            Bug ID: 1145000
           Summary: Spec %post server does not wait for the old glusterd
                    to exit
           Product: GlusterFS
           Version: 3.5.2
         Component: build
          Severity: low
          Assignee: gluster-bugs at redhat.com
          Reporter: ndevos at redhat.com
                CC: bugs at gluster.org, fabian.arrotin at arrfab.net,
                    gliverma at westga.edu, gluster-bugs at redhat.com,
                    kkeithle at redhat.com, kparthas at redhat.com,
                    lmohanty at redhat.com, ndevos at redhat.com,
                    pkarampu at redhat.com, puiterwijk at redhat.com,
                    vpvainio at iki.fi
        Depends On: 1113543
            Blocks: 1125231 (glusterfs-3.5.3)

+++ This bug was initially created as a clone of Bug #1113543 +++
+++                                                           +++
+++ This bug is for release-3.5                               +++
+++                                                           +++

Description of problem:
The %post server of gluster.spec says:
killall glusterd &> /dev/null
glusterd --xlator-option *.upgrade=on -N
This doesn't wait for the old glusterd to actually exit, so the new one sees it
cannot bind to the interface and quits, and then the original one quits,
leaving no glusterd actually running.

Version-Release number of selected component (if applicable):
glusterfs-3.5.1

How reproducible:
Everytime

Steps to Reproduce:
1. Run glusterd
2. Upgrade from 3.5.0 to 3.5.1
3.

Actual results:
No glusterd running anymore

Expected results:
An upgraded glusterd running

Additional info:

--- Additional comment from Niels de Vos on 2014-06-26 15:35:37 CEST ---

The post installation script for the glusterfs-server handles the restarting of
glusterd incorrect. This caused an outage when the glusterfs-server package was
automatically updated.

After checking the logs together with Patrick, we came to the conclusion that
the running glusterd should have received a signal and would be exiting.
However, the script does not wait for the running glusterd to exit, and starts
a new glusterd process immediately after sending the SIGTERM. In case the 1st
glusterd process has not exited yet, the new glusterd process can not listen on
port 24007 and exits. The 1st glusterd will exit eventually too, leaving the
service unavailable.

Snippet from the .spec:

 735 %post server
 ...
 769 pidof -c -o %PPID -x glusterd &> /dev/null
 770 if [ $? -eq 0 ]; then
 ...
 773     killall glusterd &> /dev/null
 774     glusterd --xlator-option *.upgrade=on -N
 775 else
 776     glusterd --xlator-option *.upgrade=on -N
 777 fi
 ...

I am not sure what the best way is to start glusterd with these specific
options once. Maybe these should get listed in /etc/sysconfig/glusterd so that
the standard init-script or systemd-job handles it?

--- Additional comment from Kaleb KEITHLEY on 2014-06-26 17:31:35 CEST ---

Which is the primary concern, that the new glusterd was started too soon? That
we need a cleaner solution for starting glusterd with the *.upgrade=on option?
Or both?

--- Additional comment from Anand Avati on 2014-06-26 23:18:17 CEST ---

REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server
doesn't wait for old glusterd) posted (#1) for review on master by Kaleb
KEITHLEY (kkeithle at redhat.com)

--- Additional comment from Anand Avati on 2014-06-27 12:36:21 CEST ---

REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server
doesn't wait for old glusterd) posted (#2) for review on master by Kaleb
KEITHLEY (kkeithle at redhat.com)

--- Additional comment from Anand Avati on 2014-06-27 13:08:14 CEST ---

REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server
doesn't wait for old glusterd) posted (#3) for review on master by Kaleb
KEITHLEY (kkeithle at redhat.com)

--- Additional comment from Anand Avati on 2014-06-30 17:15:08 CEST ---

REVIEW: http://review.gluster.org/8185 (build/glusterfs.spec.in: %post server
doesn't wait for old glusterd) posted (#4) for review on master by Kaleb
KEITHLEY (kkeithle at redhat.com)

--- Additional comment from Anand Avati on 2014-07-02 10:39:39 CEST ---

COMMIT: http://review.gluster.org/8185 committed in master by Vijay Bellur
(vbellur at redhat.com) 
------
commit 858b570a0c62d31416f0aee8c385b3118a1fad43
Author: Kaleb S. KEITHLEY <kkeithle at redhat.com>
Date:   Thu Jun 26 17:14:39 2014 -0400

    build/glusterfs.spec.in: %post server doesn't wait for old glusterd

    'killall glusterd' needs to wait for the old glusterd to exit
    before starting the updated one, otherwise the new process can't
    bind to its socket ports

    Change-Id: Ib43c76f232e0ea6f7f8469fb12be7f2b907fb7c8
    BUG: 1113543
    Signed-off-by: Kaleb S. KEITHLEY <kkeithle at redhat.com>
    Reviewed-on: http://review.gluster.org/8185
    Reviewed-by: Niels de Vos <ndevos at redhat.com>
    Reviewed-by: Lalatendu Mohanty <lmohanty at redhat.com>
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Humble Devassy Chirammal <humble.devassy at gmail.com>
    Reviewed-by: Vijay Bellur <vbellur at redhat.com>

--- Additional comment from  on 2014-08-06 18:54:22 CEST ---

I posted the below message to the gluster-users list and was asked to also post
it here:

When updates applied a couple of nights ago all my Gluster nodes went down and
"service glusterd status" reported it dead on all 3 nodes in my replicated
setup. This seems very similar to a bug that was recently fixed
(https://bugzilla.redhat.com/show_bug.cgi?id=1113543)  Any ideas what's up with
this?

[root at eapps-gluster01 ~]# rpm -qa |grep gluster
glusterfs-libs-3.5.2-1.el6.x86_64
glusterfs-cli-3.5.2-1.el6.x86_64
glusterfs-geo-replication-3.5.2-1.el6.x86_64
glusterfs-3.5.2-1.el6.x86_64
glusterfs-fuse-3.5.2-1.el6.x86_64
glusterfs-server-3.5.2-1.el6.x86_64
glusterfs-api-3.5.2-1.el6.x86_64

--- Additional comment from Lalatendu Mohanty on 2014-08-07 08:53:34 CEST ---

Looks like the fix is not working as expected. Hence moving the bug to assigned
state.

--- Additional comment from Niels de Vos on 2014-08-07 10:47:03 CEST ---

I guess it could fail in case not all packages have been updated yet. There
could be some library mismatches of some kind.

Instead of doing the kill+restart in %post, it may be safer to do it in
%posttrans (or whatever the name is)?

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1113543
[Bug 1113543] Spec %post server does not wait for the old glusterd to exit
https://bugzilla.redhat.com/show_bug.cgi?id=1125231
[Bug 1125231] GlusterFS 3.5.3 Tracker
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=cRdDI4baBT&a=cc_unsubscribe