[Bugs] [Bug 1785756] New: Mount of gluster volumes fails

Fri Dec 20 21:12:54 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1785756

            Bug ID: 1785756
           Summary: Mount of gluster volumes fails
           Product: GlusterFS
           Version: mainline
            Status: NEW
         Component: glusterd
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: zhigwang at redhat.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Created attachment 1646904
  --> https://bugzilla.redhat.com/attachment.cgi?id=1646904&action=edit
gluster pods log

Description of problem:

The customer has an issue where scaling up or deploying a pod with persistent
storage configured fails because of a mount failure of the gluster volume. 

MountVolume.SetUp failed for volume 
"pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591" : mount failed: mount failed:
 exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for
/var/lib/origin/openshift.local.volumes/pods/e918668b-7d60-11e9-b46b-fa163e4b654c/volumes/kubernetes.io~glusterfs/pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591
 --scope -- mount -t glusterfs -o 
log-level=ERROR,log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591/prometheus-k8s-0-glusterfs.log,backup-volfile-servers=192.168.8.16:192.168.8.27:192.168.8.35,auto_unmount
 192.168.8.16:vol_b2a95e6f787279137369573ba044254b 
/var/lib/origin/openshift.local.volumes/pods/e918668b-7d60-11e9-b46b-fa163e4b654c/volumes/kubernetes.io~glusterfs/pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591
Output: Running scope as unit run-128353.scope.
Mount failed. Please check the log file for more details.
the following error information was pulled from the glusterfs log to 
help diagnose this issue: 
[2019-05-23 14:11:21.535947] E [fuse-bridge.c:900:fuse_getattr_resume] 
0-glusterfs-fuse: 3: GETATTR 1 (00000000-0000-0000-0000-000000000001) 
resolution failed
The message "E [MSGID: 108006] 
[afr-common.c:4944:__afr_handle_child_down_event] 
0-vol_b2a95e6f787279137369573ba044254b-replicate-0: All subvolumes are down.
Going offline until atleast one of them comes back up." repeated 2 times
between [2019-05-23 14:11:21.516352] and [2019-05-23 14:11:21.518178]

A Red Hat employee who encountered this issue on many customer environments
(even in Sky Italia I guess) suggested me a manual workaround to fix that. It
works and he told me it is related with a known multiplexing bug not yet
solved.

Here you are the workaround I apply each time:

    If possible, scale the pod to 0 replicas

    Identify the volume name of the corrupted pv:
    $ oc get pv -o
custom-columns=pvc:.spec.claimRef.name,pv:.metadata.name,vol:.spec.glusterfs.path,size:.spec.capacity.storage,status:.status.phase,namespace:.spec.claimRef.namespace
| grep alert | grep "main-0"
    alertmanager-main-db-alertmanager-main-0  
pvc-ca80af42-3079-11e9-b3be-fa163e0f08ad   vol_f1e1da4a01aa384771e1c659a3bf6cda
  2Gi       Bound     openshift-monitoring

    Connect to one of the 3 glusterfs-storage-* pods in storage project :
    $ oc project storage
    $ oc rsh glusterfs-storage-929l6

    Execute a force start :
    sh-4.2# gluster
    gluster> volume start vol_f1e1da4a01aa384771e1c659a3bf6cda force  
    volume start: vol_f1e1da4a01aa384771e1c659a3bf6cda: success

    Scale up the pod or restart it (deleting it)

Any idea on how to fix that? Can I provide some more infos and output?
Sosreport from a storage node attached.

Thanks

Dove si è verificato questo tipo di comportamento? In quale ambiente?

OpenShift Container Platform 3.11.69 with Gluster installed on converged mode.
We have 3 storage node with 8GB ram and 4 vcpu

Quando si verifica questo tipo di comportamento? Frequentemente? Ripetutamente?
In determinati periodi?

Almost always, on >90% of starting pods

Quali informazioni puoi fornire sui periodi di tempo e sull'impatto per
l'azienda?

The impact is very high for internal development because the self-service mode
is now impossible, because we have to apply the workaround manually each time
from backend.

Additional info:

[Note]: A Red Hat employee who encountered this issue on many customer
environments (even in Sky Italia I guess) suggested me a manual workaround to
fix that. It works and he told me it is related with a known multiplexing bug
not yet solved.

However, I have not found the bug in bugzilla database.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.