[Gluster-users] [Errno 107] Transport endpoint is not connected

Olaf Buitelaar olaf.buitelaar at gmail.com
Mon Jan 27 21:49:08 UTC 2020


Dear Gluster users,

i'm a bit at a los here, and any help would be appreciated.

I've lost a couple, since the disks suffered from severe XFS error's and of
virtual machines and some won't boot because they can't resolve the size of
the image as reported by vdsm:
"VM kube-large-01 is down with error. Exit message: Unable to get volume
size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume
f16492a6-2d0e-4657-88e3-a9f4d8e48e74."

which is also reported by the vdsm-client;  vdsm-client Volume getSize
storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3
storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829
imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd
volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74
vdsm-client: Command Volume.getSize with args {'storagepoolID':
'59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID':
'5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID':
'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID':
'2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed:
(code=100, message=[Errno 107] Transport endpoint is not connected)

with corresponding gluster mount log;
[2020-01-27 19:42:22.678793] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14:
remote operation failed. Path:
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
[2020-01-27 19:42:22.678828] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13:
remote operation failed. Path:
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
[2020-01-27 19:42:22.679806] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.679862] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.679981] W [MSGID: 108027]
[afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no
read subvols for
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
[2020-01-27 19:42:22.680606] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14:
remote operation failed. Path:
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.680622] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13:
remote operation failed. Path:
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.681742] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.681871] W [MSGID: 108027]
[afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no
read subvols for
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
[2020-01-27 19:42:22.682344] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14:
remote operation failed. Path:
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
(00000000-0000-0000-0000-000000000000) [Permission denied]
The message "W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 2
times between [2020-01-27 19:42:22.679806] and [2020-01-27 19:42:22.683308]
[2020-01-27 19:42:22.683327] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.683438] W [MSGID: 108027]
[afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no
read subvols for
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
[2020-01-27 19:42:22.683495] I [dict.c:560:dict_get]
(-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b)
[0x7faaaadeb92b]
-->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78)
[0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94)
[0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid
argument]
[2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 176728: LOOKUP()
/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
=> -1 (Transport endpoint is not connected)

In addition to this, vdsm also reported it couldn't find the image of the
HostedEngine, and refused to boot;
2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [storage.TaskManager.Task]
(Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882,
in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in prepareImage
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3203,
in prepareImage
    raise se.VolumeDoesNotExist(leafUUID)
VolumeDoesNotExist: Volume does not exist:
('38e4fba7-f140-4630-afab-0f744ebe3b57',)

2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm]
(vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process failed
(vm:933)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in
_startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in
_run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in
_make_devices
    disk_objs = self._perform_host_local_adjustment()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in
_perform_host_local_adjustment
    self._preparePathsForDrives(disk_params)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in
_preparePathsForDrives
    drive, self.id, path=path
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426, in
prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'protocol': 'gluster', 'address':
{'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci',
'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64', 'index':
0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams': {},
'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64',
'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk',
'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0',
'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000',
'device': 'disk', 'path':
'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57',
'propagateErrors': 'off', 'name': 'vda', 'volumeID':
'38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network', 'alias':
'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name': '10.201.0.9',
'port': '0'}], 'discard': False}

And last, there is a storage domain which refuses to activate (from de
vsdm.log);
2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor] Error
checking path /rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
(monitor:499)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line
497, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391,
in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure:
(u'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata',
1, bytearray(b"/usr/bin/dd: failed to open
\'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\':
Transport endpoint is not connected\n"))

corresponding gluster mount log;
The message "W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0:
remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
times between [2020-01-27 19:58:33.063826] and [2020-01-27 19:59:21.690134]
The message "W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1:
remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
times between [2020-01-27 19:58:33.063734] and [2020-01-27 19:59:21.690150]
The message "W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
times between [2020-01-27 19:58:33.065027] and [2020-01-27 19:59:21.691313]
The message "W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
times between [2020-01-27 19:58:33.065106] and [2020-01-27 19:59:21.691328]
The message "W [MSGID: 108027]
[afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-mon-2-replicate-0:
no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md" repeated
4 times between [2020-01-27 19:58:33.065163] and [2020-01-27
19:59:21.691369]
[2020-01-27 19:59:50.539315] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0:
remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.539321] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1:
remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.540412] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.540477] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0:
remote operation failed. Path: (null)
(00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.540533] W [MSGID: 108027]
[afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-mon-2-replicate-0:
no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
[2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 99: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
=> -1 (Transport endpoint is not connected)
[2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 105: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 112: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 118: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 125: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 131: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 137: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 144: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 150: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 156: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 163: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 169: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 176: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)
[2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk]
0-glusterfs-fuse: 183: LOOKUP()
/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is
not connected)

and some apps directly connecting to gluster mounts report these error's;
2020-01-27  1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~' not found
(Errcode: 107 "Transport endpoint is not connected")
2020-01-27  3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113' not
found (Errcode: 107 "Transport endpoint is not connected")

So the errors seem to hint to either a connection issue or a quorum loss of
some sort. However gluster is running on it's own private and separate
network, with no firewall rules or anything else which could obstruct the
connection.
In addition gluster volume status reports all bricks and nodes are up, and
gluster volume heal reports no pending heals.
What makes this issue even more interesting is that when i manually check
the files all seems fine;

for the first issue, where the machine won't start because vdsm cannot
determine the size.
qemu is able to report the size;
qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7:
_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46
57-88e3-a9f4d8e48e74
image: /rhev/data-center/mnt/glusterSD/10.201.0.7:
_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
file format: raw
virtual size: 34T (37580963840000 bytes)
disk size: 7.1T
in addition i'm able to mount the volume using a loop device;
losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7:
_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
kpartx -av /dev/loop0
vgscan
vgchange -ay
mount /dev/mapper/cl--data5-data5 /data5/
after this i'm able to see all contents of the disk, and in fact write to
it. So the earlier reported connection error doesn't seem to apply here?
This is actually how i'm currently running the VM, where i detached the
disk, and mounted it  in the VM via the loop device. The disk is a data
disk for a heavily loaded mysql instance, and mysql is reporting no errors,
and has been running for about a day now.
Of course this not the way it should run, but it is at least working, only
performance seems a bit off. So i would like to solve the issue and being
able to attach the image as disk again.

for the second issue where the Image of the HostedEngine couldn't be found,
also all seems correct;
The file is there and having the correct permissions;
 ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9
\:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/
total 49406333
drwxr-xr-x.  2 vdsm kvm        4096 Jan 25 12:03 .
drwxr-xr-x. 13 vdsm kvm        4096 Jan 25 14:16 ..
-rw-rw----.  1 vdsm kvm 62277025792 Jan 23 03:04
38e4fba7-f140-4630-afab-0f744ebe3b57
-rw-rw----.  1 vdsm kvm     1048576 Jan 25 21:48
38e4fba7-f140-4630-afab-0f744ebe3b57.lease
-rw-r--r--.  1 vdsm kvm         285 Jan 27  2018
38e4fba7-f140-4630-afab-0f744ebe3b57.meta
And i'm able to mount the image using a loop device and access it's
contents.
Unfortunate the VM wouldn't boot due to XFS error's. After tinkering with
this for about a day to make it boot, i gave up and restored from a recent
backup. But i took the data dir from postgress from the mounted old image
to the new VM, and postgress was perfectly fine with it, also indicating
the image wasn't a complete toast.

And the last issue where the storage domain wouldn't activate. The file it
claims it cannot read in the log is perfectly readable and writable;
cat /rhev/data-center/mnt/glusterSD/10.201.0.11:
_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
CLASS=Data
DESCRIPTION=ovirt-mon-2
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3
REMOTE_PATH=10.201.0.11:/ovirt-mon-2
ROLE=Regular
SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458
TYPE=GLUSTERFS
VERSION=4
_SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807

So i've no clue where these "Transport endpoint is not connected"  are
coming from, and how to resolve them?

I think there are 4 possible causes for this issue;
1) I was trying to optimize the throughput of gluster on some volumes,
since we recently gained some additional write load, which we had
difficulty keeping up with. So I tried to incrementally
add server.event-threads, via;
gluster v set ovirt-data server.event-threads X
since this didn't seem to improve the performance i changed it back to it's
original values. But when i did that the VM's running on these volumes all
locked-up, and required a reboot, which was by than still possible. Please
note for the volumes ovirt-engine and ovirt-mon-2 this setting wasn't
changed.

2) I had a mix of running gluster 6.6 and 6.7, since i was in the middle of
upgrading all to 6.7

3) On one of the physical brick nodes, after a reboot xfs errors were
reported, and resolved by xfs_repair, which did remove some inodes in the
process. For which i wasn't too worried since i would expect the gluster
self healing daemon would resolve them, which seemed true for all volumes,
except 1, where 1 gfid was pending for about 2 days. in this case also
exactly the image which vdsm reports it cannot resolve the size from. But
there are other vm image with the same issue, which i left out for brevity.
However the pending heal of the single gfid resolved once I mounted the
image via the loop device and started writing to. Which is probably due the
nature on how gluster resolves what needs healing. Despite a gluster heal X
full was issued before.
I could also confirm the pending gfid was in fact missing on the brick node
on the underlying brick directory, while the heal was still pending.

4) I did some brick replace's (only the ovirt-data volume) but only of
arbiter bricks of the affected volume in the first issue.

the volume info's of the affected bricks look like this;

Volume Name: ovirt-data
Type: Distributed-Replicate
Volume ID: 2775dc10-c197-446e-a73f-275853d38666
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data
Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data
Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data
Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data
Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data
Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data
Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data
Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data
Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Options Reconfigured:
cluster.choose-local: off
server.outstanding-rpc-limit: 1024
storage.owner-gid: 36
storage.owner-uid: 36
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
performance.write-behind-window-size: 512MB
performance.cache-size: 384MB
server.event-threads: 5
performance.strict-o-direct: on
cluster.brick-multiplex: on

Volume Name: ovirt-engine
Type: Distributed-Replicate
Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine
Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine
Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine
Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine
Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine
Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine
Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine
Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine
Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine
Options Reconfigured:
performance.strict-o-direct: on
performance.write-behind-window-size: 512MB
features.shard-block-size: 64MB
performance.cache-size: 128MB
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
storage.owner-uid: 36
storage.owner-gid: 36
cluster.brick-multiplex: on

Volume Name: ovirt-mon-2
Type: Replicate
Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2
Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2
Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
performance.cache-size: 64MB
performance.write-behind-window-size: 128MB
features.shard-block-size: 64MB
cluster.brick-multiplex: on

Thanks Olaf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200127/f5a45580/attachment.html>


More information about the Gluster-users mailing list