[Gluster-users] [Errno 107] Transport endpoint is not connected

Thu Jan 30 15:44:37 UTC 2020

On January 29, 2020 1:43:03 PM GMT+02:00, Olaf Buitelaar <olaf.buitelaar at gmail.com> wrote:
>Hi Strahil,
>
>Thank you for your reply. I found the issue, the not connected errors
>seem
>to appear from the ACL layer. and somehow it received a permission
>denied,
>and this was translated to a not connected error.
>while the file permission were listed as owner=vdsm and group=kvm,
>somehow
>ACL saw this differently. I ran "chown -R vdsm.kvm
>/rhev/data-center/mnt/glusterSD/10.201.0.11\:_ovirt-mon-2/" on the
>mount,
>and suddenly things started working again.
>
>I indeed have (or now had, since for the restore procedure i needed to
>provide an empty domain) 1 other VM on the HostedEngine domain, this
>other
>VM had other critical services like VPN. Since i see the HostedEngine
>domain as one of the most reliable domains, i used it for critical
>services.
>All other VM's have their own domains.
>
>I'm a bit surprised by your comment about brick multiplexing, i
>understood
>this should actually improve performance, by sharing resources? Would
>you
>have some extra information about this?
>
>To answer your questions;
>
>We currently have 15 physical hosts.
>
>1) there are no pending heals
>2) yes i'm able to connect to the ports
>3) all peers report as connected
>4) Actually i had a setup like this before, i had multiple smaller qcow
>disks in a raid0 with LVM. But this did appeared not to be reliable, so
>i
>switched to 1 single large disk. Would you know if there is some
>documentation about this?
>5) i'm running about the latest and greatest stable; 4.3.7.2-1.el7.
>Only
>had trouble with the restore, because the cluster was still in
>compatibility mode 4.2 and there were 2 older VM's which had snapshots
>from
>prior versions, while the leaf was in compatibility level 4.2. note;
>the
>backup was taken on the engine running 4.3.
>
>Thanks Olaf
>
>
>
>Op di 28 jan. 2020 om 17:31 schreef Strahil Nikolov
><hunter86_bg at yahoo.com>:
>
>> On January 27, 2020 11:49:08 PM GMT+02:00, Olaf Buitelaar <
>> olaf.buitelaar at gmail.com> wrote:
>> >Dear Gluster users,
>> >
>> >i'm a bit at a los here, and any help would be appreciated.
>> >
>> >I've lost a couple, since the disks suffered from severe XFS error's
>> >and of
>> >virtual machines and some won't boot because they can't resolve the
>> >size of
>> >the image as reported by vdsm:
>> >"VM kube-large-01 is down with error. Exit message: Unable to get
>> >volume
>> >size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume
>> >f16492a6-2d0e-4657-88e3-a9f4d8e48e74."
>> >
>> >which is also reported by the vdsm-client;  vdsm-client Volume
>getSize
>> >storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3
>> >storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829
>> >imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd
>> >volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >vdsm-client: Command Volume.getSize with args {'storagepoolID':
>> >'59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID':
>> >'5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID':
>> >'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID':
>> >'2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed:
>> >(code=100, message=[Errno 107] Transport endpoint is not connected)
>> >
>> >with corresponding gluster mount log;
>> >[2020-01-27 19:42:22.678793] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-14:
>> >remote operation failed. Path:
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
>> >[2020-01-27 19:42:22.678828] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-13:
>> >remote operation failed. Path:
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
>> >[2020-01-27 19:42:22.679806] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-14:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:42:22.679862] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-13:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:42:22.679981] W [MSGID: 108027]
>> >[afr-common.c:2274:afr_attempt_readsubvol_set]
>> >0-ovirt-data-replicate-3: no
>> >read subvols for
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >[2020-01-27 19:42:22.680606] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-14:
>> >remote operation failed. Path:
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:42:22.680622] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-13:
>> >remote operation failed. Path:
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:42:22.681742] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-13:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:42:22.681871] W [MSGID: 108027]
>> >[afr-common.c:2274:afr_attempt_readsubvol_set]
>> >0-ovirt-data-replicate-3: no
>> >read subvols for
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >[2020-01-27 19:42:22.682344] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-14:
>> >remote operation failed. Path:
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >The message "W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-14:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated
>2
>> >times between [2020-01-27 19:42:22.679806] and [2020-01-27
>> >19:42:22.683308]
>> >[2020-01-27 19:42:22.683327] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-data-client-13:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:42:22.683438] W [MSGID: 108027]
>> >[afr-common.c:2274:afr_attempt_readsubvol_set]
>> >0-ovirt-data-replicate-3: no
>> >read subvols for
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >[2020-01-27 19:42:22.683495] I [dict.c:560:dict_get]
>> >(-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b)
>> >[0x7faaaadeb92b]
>> >-->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78)
>> >[0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94)
>> >[0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds
>> >[Invalid
>> >argument]
>> >[2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 176728: LOOKUP()
>>
>>
>>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >=> -1 (Transport endpoint is not connected)
>> >
>> >In addition to this, vdsm also reported it couldn't find the image
>of
>> >the
>> >HostedEngine, and refused to boot;
>> >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd)
>> >[storage.TaskManager.Task]
>> >(Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error
>> >(task:875)
>> >Traceback (most recent call last):
>> >File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line
>882,
>> >in _run
>> >    return fn(*args, **kargs)
>> >  File "<string>", line 2, in prepareImage
>> >File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50,
>in
>> >method
>> >    ret = func(*args, **kwargs)
>> >File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line
>3203,
>> >in prepareImage
>> >    raise se.VolumeDoesNotExist(leafUUID)
>> >VolumeDoesNotExist: Volume does not exist:
>> >('38e4fba7-f140-4630-afab-0f744ebe3b57',)
>> >
>> >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm]
>> >(vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process
>> >failed
>> >(vm:933)
>> >Traceback (most recent call last):
>> >  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867,
>in
>> >_startUnderlyingVm
>> >    self._run()
>> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795,
>in
>> >_run
>> >    self._devices = self._make_devices()
>> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635,
>in
>> >_make_devices
>> >    disk_objs = self._perform_host_local_adjustment()
>> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708,
>in
>> >_perform_host_local_adjustment
>> >    self._preparePathsForDrives(disk_params)
>> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036,
>in
>> >_preparePathsForDrives
>> >    drive, self.id, path=path
>> > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426,
>in
>> >prepareVolumePath
>> >    raise vm.VolumeError(drive)
>> >VolumeError: Bad volume specification {'protocol': 'gluster',
>> >'address':
>> >{'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type':
>'pci',
>> >'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64',
>> >'index':
>> >0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams':
>{},
>> >'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64',
>> >'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk',
>> >'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0',
>> >'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000',
>> >'device': 'disk', 'path':
>>
>>
>>'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57',
>> >'propagateErrors': 'off', 'name': 'vda', 'volumeID':
>> >'38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network',
>'alias':
>> >'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name':
>> >'10.201.0.9',
>> >'port': '0'}], 'discard': False}
>> >
>> >And last, there is a storage domain which refuses to activate (from
>de
>> >vsdm.log);
>> >2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor]
>Error
>> >checking path
>> >/rhev/data-center/mnt/glusterSD/10.201.0.11:
>> _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
>> >(monitor:499)
>> >Traceback (most recent call last):
>> >  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py",
>line
>> >497, in _pathChecked
>> >    delay = result.delay()
>> >File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line
>> >391,
>> >in delay
>> >    raise exception.MiscFileReadException(self.path, self.rc,
>self.err)
>> >MiscFileReadException: Internal file read failure:
>> >(u'/rhev/data-center/mnt/glusterSD/10.201.0.11:
>> _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata',
>> >1, bytearray(b"/usr/bin/dd: failed to open
>> >\'/rhev/data-center/mnt/glusterSD/10.201.0.11:
>> _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\':
>> >Transport endpoint is not connected\n"))
>> >
>> >corresponding gluster mount log;
>> >The message "W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-0:
>> >remote operation failed. Path:
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated
>4
>> >times between [2020-01-27 19:58:33.063826] and [2020-01-27
>> >19:59:21.690134]
>> >The message "W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-1:
>> >remote operation failed. Path:
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated
>4
>> >times between [2020-01-27 19:58:33.063734] and [2020-01-27
>> >19:59:21.690150]
>> >The message "W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-0:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated
>4
>> >times between [2020-01-27 19:58:33.065027] and [2020-01-27
>> >19:59:21.691313]
>> >The message "W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-1:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated
>4
>> >times between [2020-01-27 19:58:33.065106] and [2020-01-27
>> >19:59:21.691328]
>> >The message "W [MSGID: 108027]
>> >[afr-common.c:2274:afr_attempt_readsubvol_set]
>> >0-ovirt-mon-2-replicate-0:
>> >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md"
>> >repeated
>> >4 times between [2020-01-27 19:58:33.065163] and [2020-01-27
>> >19:59:21.691369]
>> >[2020-01-27 19:59:50.539315] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-0:
>> >remote operation failed. Path:
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:59:50.539321] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-1:
>> >remote operation failed. Path:
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:59:50.540412] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-1:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:59:50.540477] W [MSGID: 114031]
>> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>> >0-ovirt-mon-2-client-0:
>> >remote operation failed. Path: (null)
>> >(00000000-0000-0000-0000-000000000000) [Permission denied]
>> >[2020-01-27 19:59:50.540533] W [MSGID: 108027]
>> >[afr-common.c:2274:afr_attempt_readsubvol_set]
>> >0-ovirt-mon-2-replicate-0:
>> >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>> >[2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 99: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>> >=> -1 (Transport endpoint is not connected)
>> >[2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 105: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 112: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 118: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 125: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 131: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 137: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 144: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 150: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 156: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 163: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 169: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 176: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >[2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk]
>> >0-glusterfs-fuse: 183: LOOKUP()
>> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport
>endpoint
>> >is
>> >not connected)
>> >
>> >and some apps directly connecting to gluster mounts report these
>> >error's;
>> >2020-01-27  1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~'
>not
>> >found
>> >(Errcode: 107 "Transport endpoint is not connected")
>> >2020-01-27  3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113'
>not
>> >found (Errcode: 107 "Transport endpoint is not connected")
>> >
>> >So the errors seem to hint to either a connection issue or a quorum
>> >loss of
>> >some sort. However gluster is running on it's own private and
>separate
>> >network, with no firewall rules or anything else which could
>obstruct
>> >the
>> >connection.
>> >In addition gluster volume status reports all bricks and nodes are
>up,
>> >and
>> >gluster volume heal reports no pending heals.
>> >What makes this issue even more interesting is that when i manually
>> >check
>> >the files all seems fine;
>> >
>> >for the first issue, where the machine won't start because vdsm
>cannot
>> >determine the size.
>> >qemu is able to report the size;
>> >qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7:
>>
>>
>>_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46
>> >57-88e3-a9f4d8e48e74
>> >image: /rhev/data-center/mnt/glusterSD/10.201.0.7:
>>
>>
>>_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >file format: raw
>> >virtual size: 34T (37580963840000 bytes)
>> >disk size: 7.1T
>> >in addition i'm able to mount the volume using a loop device;
>> >losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7:
>>
>>
>>_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>> >kpartx -av /dev/loop0
>> >vgscan
>> >vgchange -ay
>> >mount /dev/mapper/cl--data5-data5 /data5/
>> >after this i'm able to see all contents of the disk, and in fact
>write
>> >to
>> >it. So the earlier reported connection error doesn't seem to apply
>> >here?
>> >This is actually how i'm currently running the VM, where i detached
>the
>> >disk, and mounted it  in the VM via the loop device. The disk is a
>data
>> >disk for a heavily loaded mysql instance, and mysql is reporting no
>> >errors,
>> >and has been running for about a day now.
>> >Of course this not the way it should run, but it is at least
>working,
>> >only
>> >performance seems a bit off. So i would like to solve the issue and
>> >being
>> >able to attach the image as disk again.
>> >
>> >for the second issue where the Image of the HostedEngine couldn't be
>> >found,
>> >also all seems correct;
>> >The file is there and having the correct permissions;
>> > ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9
>>
>>
>>\:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/
>> >total 49406333
>> >drwxr-xr-x.  2 vdsm kvm        4096 Jan 25 12:03 .
>> >drwxr-xr-x. 13 vdsm kvm        4096 Jan 25 14:16 ..
>> >-rw-rw----.  1 vdsm kvm 62277025792 Jan 23 03:04
>> >38e4fba7-f140-4630-afab-0f744ebe3b57
>> >-rw-rw----.  1 vdsm kvm     1048576 Jan 25 21:48
>> >38e4fba7-f140-4630-afab-0f744ebe3b57.lease
>> >-rw-r--r--.  1 vdsm kvm         285 Jan 27  2018
>> >38e4fba7-f140-4630-afab-0f744ebe3b57.meta
>> >And i'm able to mount the image using a loop device and access it's
>> >contents.
>> >Unfortunate the VM wouldn't boot due to XFS error's. After tinkering
>> >with
>> >this for about a day to make it boot, i gave up and restored from a
>> >recent
>> >backup. But i took the data dir from postgress from the mounted old
>> >image
>> >to the new VM, and postgress was perfectly fine with it, also
>> >indicating
>> >the image wasn't a complete toast.
>> >
>> >And the last issue where the storage domain wouldn't activate. The
>file
>> >it
>> >claims it cannot read in the log is perfectly readable and writable;
>> >cat /rhev/data-center/mnt/glusterSD/10.201.0.11:
>> >_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
>> >CLASS=Data
>> >DESCRIPTION=ovirt-mon-2
>> >IOOPTIMEOUTSEC=10
>> >LEASERETRIES=3
>> >LEASETIMESEC=60
>> >LOCKPOLICY=
>> >LOCKRENEWALINTERVALSEC=5
>> >POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3
>> >REMOTE_PATH=10.201.0.11:/ovirt-mon-2
>> >ROLE=Regular
>> >SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458
>> >TYPE=GLUSTERFS
>> >VERSION=4
>> >_SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807
>> >
>> >So i've no clue where these "Transport endpoint is not connected" 
>are
>> >coming from, and how to resolve them?
>> >
>> >I think there are 4 possible causes for this issue;
>> >1) I was trying to optimize the throughput of gluster on some
>volumes,
>> >since we recently gained some additional write load, which we had
>> >difficulty keeping up with. So I tried to incrementally
>> >add server.event-threads, via;
>> >gluster v set ovirt-data server.event-threads X
>> >since this didn't seem to improve the performance i changed it back
>to
>> >it's
>> >original values. But when i did that the VM's running on these
>volumes
>> >all
>> >locked-up, and required a reboot, which was by than still possible.
>> >Please
>> >note for the volumes ovirt-engine and ovirt-mon-2 this setting
>wasn't
>> >changed.
>> >
>> >2) I had a mix of running gluster 6.6 and 6.7, since i was in the
>> >middle of
>> >upgrading all to 6.7
>> >
>> >3) On one of the physical brick nodes, after a reboot xfs errors
>were
>> >reported, and resolved by xfs_repair, which did remove some inodes
>in
>> >the
>> >process. For which i wasn't too worried since i would expect the
>> >gluster
>> >self healing daemon would resolve them, which seemed true for all
>> >volumes,
>> >except 1, where 1 gfid was pending for about 2 days. in this case
>also
>> >exactly the image which vdsm reports it cannot resolve the size
>from.
>> >But
>> >there are other vm image with the same issue, which i left out for
>> >brevity.
>> >However the pending heal of the single gfid resolved once I mounted
>the
>> >image via the loop device and started writing to. Which is probably
>due
>> >the
>> >nature on how gluster resolves what needs healing. Despite a gluster
>> >heal X
>> >full was issued before.
>> >I could also confirm the pending gfid was in fact missing on the
>brick
>> >node
>> >on the underlying brick directory, while the heal was still pending.
>> >
>> >4) I did some brick replace's (only the ovirt-data volume) but only
>of
>> >arbiter bricks of the affected volume in the first issue.
>> >
>> >the volume info's of the affected bricks look like this;
>> >
>> >Volume Name: ovirt-data
>> >Type: Distributed-Replicate
>> >Volume ID: 2775dc10-c197-446e-a73f-275853d38666
>> >Status: Started
>> >Snapshot Count: 0
>> >Number of Bricks: 4 x (2 + 1) = 12
>> >Transport-type: tcp
>> >Bricks:
>> >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>> >Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>> >Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>> >Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data
>> >Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>> >Options Reconfigured:
>> >cluster.choose-local: off
>> >server.outstanding-rpc-limit: 1024
>> >storage.owner-gid: 36
>> >storage.owner-uid: 36
>> >transport.address-family: inet
>> >performance.readdir-ahead: on
>> >nfs.disable: on
>> >performance.quick-read: off
>> >performance.read-ahead: off
>> >performance.io-cache: off
>> >performance.stat-prefetch: off
>> >performance.low-prio-threads: 32
>> >network.remote-dio: off
>> >cluster.eager-lock: enable
>> >cluster.quorum-type: auto
>> >cluster.server-quorum-type: server
>> >cluster.data-self-heal-algorithm: full
>> >cluster.locking-scheme: granular
>> >cluster.shd-max-threads: 8
>> >cluster.shd-wait-qlength: 10000
>> >features.shard: on
>> >user.cifs: off
>> >performance.write-behind-window-size: 512MB
>> >performance.cache-size: 384MB
>> >server.event-threads: 5
>> >performance.strict-o-direct: on
>> >cluster.brick-multiplex: on
>> >
>> >Volume Name: ovirt-engine
>> >Type: Distributed-Replicate
>> >Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc
>> >Status: Started
>> >Snapshot Count: 0
>> >Number of Bricks: 3 x 3 = 9
>> >Transport-type: tcp
>> >Bricks:
>> >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine
>> >Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine
>> >Options Reconfigured:
>> >performance.strict-o-direct: on
>> >performance.write-behind-window-size: 512MB
>> >features.shard-block-size: 64MB
>> >performance.cache-size: 128MB
>> >nfs.disable: on
>> >transport.address-family: inet
>> >performance.quick-read: off
>> >performance.read-ahead: off
>> >performance.io-cache: off
>> >performance.low-prio-threads: 32
>> >network.remote-dio: enable
>> >cluster.eager-lock: enable
>> >cluster.quorum-type: auto
>> >cluster.server-quorum-type: server
>> >cluster.data-self-heal-algorithm: full
>> >cluster.locking-scheme: granular
>> >cluster.shd-max-threads: 8
>> >cluster.shd-wait-qlength: 10000
>> >features.shard: on
>> >user.cifs: off
>> >storage.owner-uid: 36
>> >storage.owner-gid: 36
>> >cluster.brick-multiplex: on
>> >
>> >Volume Name: ovirt-mon-2
>> >Type: Replicate
>> >Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd
>> >Status: Started
>> >Snapshot Count: 0
>> >Number of Bricks: 1 x (2 + 1) = 3
>> >Transport-type: tcp
>> >Bricks:
>> >Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2
>> >Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2
>> >Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter)
>> >Options Reconfigured:
>> >performance.client-io-threads: on
>> >nfs.disable: on
>> >transport.address-family: inet
>> >performance.quick-read: off
>> >performance.read-ahead: off
>> >performance.io-cache: off
>> >performance.low-prio-threads: 32
>> >network.remote-dio: off
>> >cluster.eager-lock: enable
>> >cluster.quorum-type: auto
>> >cluster.server-quorum-type: server
>> >cluster.data-self-heal-algorithm: full
>> >cluster.locking-scheme: granular
>> >cluster.shd-max-threads: 8
>> >cluster.shd-wait-qlength: 10000
>> >features.shard: on
>> >user.cifs: off
>> >cluster.choose-local: off
>> >client.event-threads: 4
>> >server.event-threads: 4
>> >storage.owner-uid: 36
>> >storage.owner-gid: 36
>> >performance.strict-o-direct: on
>> >performance.cache-size: 64MB
>> >performance.write-behind-window-size: 128MB
>> >features.shard-block-size: 64MB
>> >cluster.brick-multiplex: on
>> >
>> >Thanks Olaf
>>
>> Hi Olaf,
>>
>> Thanks  for the detailed output.
>> On first glance I have noticed that you have a HostedEngine domain
>for
>> both ovirt's engine VM + for other VMs , is that right?
>> If yes, that's against best practices and not recommended.
>> Second, you use brick multiplexing, but according to RH documentation
>-
>> that feature is not supported for your workload - so in your case its
>> drawing attention but should not be a problem.
>>
>> Can you specify how many physical hosts do you have ?
>>
>> I will try to check the output deeper, but I think you need to check:
>> 1. Check gluster heal status - any pending heals should be resolved
>> 2. Use telnet/nc/ncat/netcat to verify that each host sees the peers'
>> brick ports.
>> 3. gluster volume heal <volume> info should report all bricks arr
>connected
>> gluster volume status must report all bricks have a pid
>> 4. OPTIONAL - Try to create smaller (it's not a good idea to have
>large
>> qcow2 disks) disks  via oVirt and assign them to your mysql. Then try
>to
>> pvmove the LVs from the disk (mounted with loop) to the new disks -
>that
>> way you can get rid of the old qcow disk .
>> 5. What is your oVirt version ? Could it be an old 3.x ?
>>
>> Don't forget to backup :)
>>
>> Best Regards,
>> Strahil Nikolov
>>

Hi Olaf,

I had an issue with Gluster and ACL.
Devs mentioned 2  approaches to fix:
1. Run a find with dummy acl. Something like:
find /rhev/full/path/to/share -exec setfacl -m u:root:rwx {} \;
2. Stop and start volume

As you fixed it , you won't have to do it.

Multiplexing is good if you have a lot of bricks on 1 node - yours are not so much.

Here is a short quotation from the docs:

21.2.1. Many Bricks per Node

By default, for every brick configured on a Red Hat Gluster Storage server node, one process is created and one port is consumed. If you have a large number of bricks configured on a single server, enabling brick multiplexing reduces port and memory consumption by allowing compatible bricks to use the same process and port. Red Hat recommends restarting all volumes after enabling or disabling brick multiplexing.

As of Red Hat Gluster Storage 3.3, brick multiplexing is supported only for Container-Native Storage (CNS) and Container-Ready Storage(CRS) use cases.

Source: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/brick_configuration

Smaller disks are easier for storage migration. As you have multiple nodes  in Gluster ,shards will be scattered all over the bricks - so in your case will be able to read the shards from up to 15 Bricks.
For me 2 TB is enough, but it's up to you.

You still need to separate the volume for HostedEnfine from the rest - there was a discussion on users at ovirt.org about that and the risks of using shared volume.

Best Regards,
Strahil Nikolov