[Bugs] [Bug 1636088] New: ocf:glusterfs: volume resource agent for pacemaker fails to stop gluster volume processes glusterfsd

Thu Oct 4 12:45:40 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1636088

            Bug ID: 1636088
           Summary: ocf:glusterfs:volume resource agent for pacemaker
                    fails to stop gluster volume processes glusterfsd
           Product: GlusterFS
           Version: 3.12
         Component: common-ha
          Assignee: bugs at gluster.org
          Reporter: erik.dobak at gmail.com
                CC: bugs at gluster.org

Description of problem:

I am using pacemaker to run glusterfs. After setting it up i tested it with
'crm node standby node01' but got a 'time out' from the volume agent:
crmd:    error: process_lrm_event:    Result of stop operation for
p_volume_gluster on node02: Timed Out | call=559 key=p_volume_gluster_stop_0
timeout=20000ms

when checking the processes with ps -ef i still could see gluster processes
running on the node.

Version-Release number of selected component (if applicable):

Name        : glusterfs-resource-agents
Arch        : noarch
Version     : 3.12.14
Release     : 1.el6
Size        : 13 k
Repo        : installed
>From repo   : centos-gluster312

How reproducible:
configure gluster in pacemaker (2 nodes):

primitive glusterd ocf:glusterfs:glusterd \
    op monitor interval=10 timeout=120s \
    op start timeout=120s interval=0 \
    op stop timeout=120s interval=0

primitive p_volume_gluster ocf:glusterfs:volume \
    params volname=gv0 \
    op stop interval=0 trace_ra=1 \
    op monitor interval=0 timeout=120s \
    op start timeout=120s interval=0

clone cl_glusterd glusterd \
    meta interleave=true clone-max=2 clone-node-max=1 target-role=Started

clone cl_glustervol p_volume_gluster \
    meta interleave=true clone-max=2 clone-node-max=1

run the gluster in the cluster then put a node on standby.

Steps to Reproduce:
1. start gluster in pacemaker
2. put a node on standby: crm node standby node01
3. wait for the error messages

Actual results:
getting a time out error for the volume primitive. the processes are still
running: /usr/sbin/glusterfsd

Expected results:
gluster should shutdown and no error should be in corosync.log

Additional info:
i did do debuging of the volume resource agent
(/usr/lib/ocf/resource.d/glusterfs/volume) and could find 2 issues that
prevented the agent to stop the processes.

1. SHORTHOSTNAME=`hostname -s`
In my system only the full hostname was used. i had to change this line to:
SHORTHOSTNAME=`hostname`

2. function volume_getdir() had wrong path hardcoded

volume_getdir() {
    local voldir
    voldir="/etc/glusterd/vols/${OCF_RESKEY_volname}"
    [ -d ${voldir} ] || return 1

    echo "${voldir}"
    return 0
}

i had to change /etc/glusterd into /var/lib/glusterd:

volume_getdir() {
    local voldir
    voldir="/var/lib/glusterd/vols/${OCF_RESKEY_volname}"
    [ -d ${voldir} ] || return 1

    echo "${voldir}"
    return 0
}

i am not sure if this is because of i am running centos 6. maybe the paths and
hostnames differ on centos 7..

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.