[Bugs] [Bug 1785756] New: Mount of gluster volumes fails
bugzilla at redhat.com
bugzilla at redhat.com
Fri Dec 20 21:12:54 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1785756
Bug ID: 1785756
Summary: Mount of gluster volumes fails
Product: GlusterFS
Version: mainline
Status: NEW
Component: glusterd
Severity: medium
Assignee: bugs at gluster.org
Reporter: zhigwang at redhat.com
CC: bugs at gluster.org
Target Milestone: ---
Classification: Community
Created attachment 1646904
--> https://bugzilla.redhat.com/attachment.cgi?id=1646904&action=edit
gluster pods log
Description of problem:
The customer has an issue where scaling up or deploying a pod with persistent
storage configured fails because of a mount failure of the gluster volume.
MountVolume.SetUp failed for volume
"pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591" : mount failed: mount failed:
exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for
/var/lib/origin/openshift.local.volumes/pods/e918668b-7d60-11e9-b46b-fa163e4b654c/volumes/kubernetes.io~glusterfs/pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591
--scope -- mount -t glusterfs -o
log-level=ERROR,log-file=/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/glusterfs/pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591/prometheus-k8s-0-glusterfs.log,backup-volfile-servers=192.168.8.16:192.168.8.27:192.168.8.35,auto_unmount
192.168.8.16:vol_b2a95e6f787279137369573ba044254b
/var/lib/origin/openshift.local.volumes/pods/e918668b-7d60-11e9-b46b-fa163e4b654c/volumes/kubernetes.io~glusterfs/pvc-8488da95-5d1d-11e9-b5ae-fa163eb4c591
Output: Running scope as unit run-128353.scope.
Mount failed. Please check the log file for more details.
the following error information was pulled from the glusterfs log to
help diagnose this issue:
[2019-05-23 14:11:21.535947] E [fuse-bridge.c:900:fuse_getattr_resume]
0-glusterfs-fuse: 3: GETATTR 1 (00000000-0000-0000-0000-000000000001)
resolution failed
The message "E [MSGID: 108006]
[afr-common.c:4944:__afr_handle_child_down_event]
0-vol_b2a95e6f787279137369573ba044254b-replicate-0: All subvolumes are down.
Going offline until atleast one of them comes back up." repeated 2 times
between [2019-05-23 14:11:21.516352] and [2019-05-23 14:11:21.518178]
A Red Hat employee who encountered this issue on many customer environments
(even in Sky Italia I guess) suggested me a manual workaround to fix that. It
works and he told me it is related with a known multiplexing bug not yet
solved.
Here you are the workaround I apply each time:
If possible, scale the pod to 0 replicas
Identify the volume name of the corrupted pv:
$ oc get pv -o
custom-columns=pvc:.spec.claimRef.name,pv:.metadata.name,vol:.spec.glusterfs.path,size:.spec.capacity.storage,status:.status.phase,namespace:.spec.claimRef.namespace
| grep alert | grep "main-0"
alertmanager-main-db-alertmanager-main-0
pvc-ca80af42-3079-11e9-b3be-fa163e0f08ad vol_f1e1da4a01aa384771e1c659a3bf6cda
2Gi Bound openshift-monitoring
Connect to one of the 3 glusterfs-storage-* pods in storage project :
$ oc project storage
$ oc rsh glusterfs-storage-929l6
Execute a force start :
sh-4.2# gluster
gluster> volume start vol_f1e1da4a01aa384771e1c659a3bf6cda force
volume start: vol_f1e1da4a01aa384771e1c659a3bf6cda: success
Scale up the pod or restart it (deleting it)
Any idea on how to fix that? Can I provide some more infos and output?
Sosreport from a storage node attached.
Thanks
Dove si è verificato questo tipo di comportamento? In quale ambiente?
OpenShift Container Platform 3.11.69 with Gluster installed on converged mode.
We have 3 storage node with 8GB ram and 4 vcpu
Quando si verifica questo tipo di comportamento? Frequentemente? Ripetutamente?
In determinati periodi?
Almost always, on >90% of starting pods
Quali informazioni puoi fornire sui periodi di tempo e sull'impatto per
l'azienda?
The impact is very high for internal development because the self-service mode
is now impossible, because we have to apply the workaround manually each time
from backend.
Additional info:
[Note]: A Red Hat employee who encountered this issue on many customer
environments (even in Sky Italia I guess) suggested me a manual workaround to
fix that. It works and he told me it is related with a known multiplexing bug
not yet solved.
However, I have not found the bug in bugzilla database.
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list