[Gluster-users] [Errno 107] Transport endpoint is not connected

Olaf Buitelaar olaf.buitelaar at gmail.com
Wed Jan 29 11:43:03 UTC 2020


Hi Strahil,

Thank you for your reply. I found the issue, the not connected errors seem
to appear from the ACL layer. and somehow it received a permission denied,
and this was translated to a not connected error.
while the file permission were listed as owner=vdsm and group=kvm, somehow
ACL saw this differently. I ran "chown -R vdsm.kvm
/rhev/data-center/mnt/glusterSD/10.201.0.11\:_ovirt-mon-2/" on the mount,
and suddenly things started working again.

I indeed have (or now had, since for the restore procedure i needed to
provide an empty domain) 1 other VM on the HostedEngine domain, this other
VM had other critical services like VPN. Since i see the HostedEngine
domain as one of the most reliable domains, i used it for critical
services.
All other VM's have their own domains.

I'm a bit surprised by your comment about brick multiplexing, i understood
this should actually improve performance, by sharing resources? Would you
have some extra information about this?

To answer your questions;

We currently have 15 physical hosts.

1) there are no pending heals
2) yes i'm able to connect to the ports
3) all peers report as connected
4) Actually i had a setup like this before, i had multiple smaller qcow
disks in a raid0 with LVM. But this did appeared not to be reliable, so i
switched to 1 single large disk. Would you know if there is some
documentation about this?
5) i'm running about the latest and greatest stable; 4.3.7.2-1.el7. Only
had trouble with the restore, because the cluster was still in
compatibility mode 4.2 and there were 2 older VM's which had snapshots from
prior versions, while the leaf was in compatibility level 4.2. note; the
backup was taken on the engine running 4.3.

Thanks Olaf



Op di 28 jan. 2020 om 17:31 schreef Strahil Nikolov <hunter86_bg at yahoo.com>:

> On January 27, 2020 11:49:08 PM GMT+02:00, Olaf Buitelaar <
> olaf.buitelaar at gmail.com> wrote:
> >Dear Gluster users,
> >
> >i'm a bit at a los here, and any help would be appreciated.
> >
> >I've lost a couple, since the disks suffered from severe XFS error's
> >and of
> >virtual machines and some won't boot because they can't resolve the
> >size of
> >the image as reported by vdsm:
> >"VM kube-large-01 is down with error. Exit message: Unable to get
> >volume
> >size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume
> >f16492a6-2d0e-4657-88e3-a9f4d8e48e74."
> >
> >which is also reported by the vdsm-client;  vdsm-client Volume getSize
> >storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3
> >storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829
> >imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd
> >volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >vdsm-client: Command Volume.getSize with args {'storagepoolID':
> >'59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID':
> >'5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID':
> >'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID':
> >'2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed:
> >(code=100, message=[Errno 107] Transport endpoint is not connected)
> >
> >with corresponding gluster mount log;
> >[2020-01-27 19:42:22.678793] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-14:
> >remote operation failed. Path:
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
> >[2020-01-27 19:42:22.678828] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-13:
> >remote operation failed. Path:
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
> >[2020-01-27 19:42:22.679806] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-14:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:42:22.679862] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-13:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:42:22.679981] W [MSGID: 108027]
> >[afr-common.c:2274:afr_attempt_readsubvol_set]
> >0-ovirt-data-replicate-3: no
> >read subvols for
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >[2020-01-27 19:42:22.680606] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-14:
> >remote operation failed. Path:
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:42:22.680622] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-13:
> >remote operation failed. Path:
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:42:22.681742] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-13:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:42:22.681871] W [MSGID: 108027]
> >[afr-common.c:2274:afr_attempt_readsubvol_set]
> >0-ovirt-data-replicate-3: no
> >read subvols for
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >[2020-01-27 19:42:22.682344] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-14:
> >remote operation failed. Path:
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >The message "W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-14:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 2
> >times between [2020-01-27 19:42:22.679806] and [2020-01-27
> >19:42:22.683308]
> >[2020-01-27 19:42:22.683327] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-data-client-13:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:42:22.683438] W [MSGID: 108027]
> >[afr-common.c:2274:afr_attempt_readsubvol_set]
> >0-ovirt-data-replicate-3: no
> >read subvols for
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >[2020-01-27 19:42:22.683495] I [dict.c:560:dict_get]
> >(-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b)
> >[0x7faaaadeb92b]
> >-->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78)
> >[0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94)
> >[0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds
> >[Invalid
> >argument]
> >[2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 176728: LOOKUP()
>
> >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >=> -1 (Transport endpoint is not connected)
> >
> >In addition to this, vdsm also reported it couldn't find the image of
> >the
> >HostedEngine, and refused to boot;
> >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd)
> >[storage.TaskManager.Task]
> >(Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error
> >(task:875)
> >Traceback (most recent call last):
> >File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882,
> >in _run
> >    return fn(*args, **kargs)
> >  File "<string>", line 2, in prepareImage
> >File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
> >method
> >    ret = func(*args, **kwargs)
> >File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3203,
> >in prepareImage
> >    raise se.VolumeDoesNotExist(leafUUID)
> >VolumeDoesNotExist: Volume does not exist:
> >('38e4fba7-f140-4630-afab-0f744ebe3b57',)
> >
> >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm]
> >(vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process
> >failed
> >(vm:933)
> >Traceback (most recent call last):
> >  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in
> >_startUnderlyingVm
> >    self._run()
> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in
> >_run
> >    self._devices = self._make_devices()
> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in
> >_make_devices
> >    disk_objs = self._perform_host_local_adjustment()
> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in
> >_perform_host_local_adjustment
> >    self._preparePathsForDrives(disk_params)
> > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in
> >_preparePathsForDrives
> >    drive, self.id, path=path
> > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426, in
> >prepareVolumePath
> >    raise vm.VolumeError(drive)
> >VolumeError: Bad volume specification {'protocol': 'gluster',
> >'address':
> >{'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci',
> >'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64',
> >'index':
> >0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams': {},
> >'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64',
> >'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk',
> >'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0',
> >'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000',
> >'device': 'disk', 'path':
>
> >'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57',
> >'propagateErrors': 'off', 'name': 'vda', 'volumeID':
> >'38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network', 'alias':
> >'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name':
> >'10.201.0.9',
> >'port': '0'}], 'discard': False}
> >
> >And last, there is a storage domain which refuses to activate (from de
> >vsdm.log);
> >2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor] Error
> >checking path
> >/rhev/data-center/mnt/glusterSD/10.201.0.11:
> _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
> >(monitor:499)
> >Traceback (most recent call last):
> >  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line
> >497, in _pathChecked
> >    delay = result.delay()
> >File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line
> >391,
> >in delay
> >    raise exception.MiscFileReadException(self.path, self.rc, self.err)
> >MiscFileReadException: Internal file read failure:
> >(u'/rhev/data-center/mnt/glusterSD/10.201.0.11:
> _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata',
> >1, bytearray(b"/usr/bin/dd: failed to open
> >\'/rhev/data-center/mnt/glusterSD/10.201.0.11:
> _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\':
> >Transport endpoint is not connected\n"))
> >
> >corresponding gluster mount log;
> >The message "W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-0:
> >remote operation failed. Path:
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
> >times between [2020-01-27 19:58:33.063826] and [2020-01-27
> >19:59:21.690134]
> >The message "W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-1:
> >remote operation failed. Path:
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
> >times between [2020-01-27 19:58:33.063734] and [2020-01-27
> >19:59:21.690150]
> >The message "W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-0:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
> >times between [2020-01-27 19:58:33.065027] and [2020-01-27
> >19:59:21.691313]
> >The message "W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-1:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
> >times between [2020-01-27 19:58:33.065106] and [2020-01-27
> >19:59:21.691328]
> >The message "W [MSGID: 108027]
> >[afr-common.c:2274:afr_attempt_readsubvol_set]
> >0-ovirt-mon-2-replicate-0:
> >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md"
> >repeated
> >4 times between [2020-01-27 19:58:33.065163] and [2020-01-27
> >19:59:21.691369]
> >[2020-01-27 19:59:50.539315] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-0:
> >remote operation failed. Path:
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:59:50.539321] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-1:
> >remote operation failed. Path:
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:59:50.540412] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-1:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:59:50.540477] W [MSGID: 114031]
> >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
> >0-ovirt-mon-2-client-0:
> >remote operation failed. Path: (null)
> >(00000000-0000-0000-0000-000000000000) [Permission denied]
> >[2020-01-27 19:59:50.540533] W [MSGID: 108027]
> >[afr-common.c:2274:afr_attempt_readsubvol_set]
> >0-ovirt-mon-2-replicate-0:
> >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
> >[2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 99: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
> >=> -1 (Transport endpoint is not connected)
> >[2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 105: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 112: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 118: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 125: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 131: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 137: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 144: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 150: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 156: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 163: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 169: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 176: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >[2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk]
> >0-glusterfs-fuse: 183: LOOKUP()
> >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
> >is
> >not connected)
> >
> >and some apps directly connecting to gluster mounts report these
> >error's;
> >2020-01-27  1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~' not
> >found
> >(Errcode: 107 "Transport endpoint is not connected")
> >2020-01-27  3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113' not
> >found (Errcode: 107 "Transport endpoint is not connected")
> >
> >So the errors seem to hint to either a connection issue or a quorum
> >loss of
> >some sort. However gluster is running on it's own private and separate
> >network, with no firewall rules or anything else which could obstruct
> >the
> >connection.
> >In addition gluster volume status reports all bricks and nodes are up,
> >and
> >gluster volume heal reports no pending heals.
> >What makes this issue even more interesting is that when i manually
> >check
> >the files all seems fine;
> >
> >for the first issue, where the machine won't start because vdsm cannot
> >determine the size.
> >qemu is able to report the size;
> >qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7:
>
> >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46
> >57-88e3-a9f4d8e48e74
> >image: /rhev/data-center/mnt/glusterSD/10.201.0.7:
>
> >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >file format: raw
> >virtual size: 34T (37580963840000 bytes)
> >disk size: 7.1T
> >in addition i'm able to mount the volume using a loop device;
> >losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7:
>
> >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
> >kpartx -av /dev/loop0
> >vgscan
> >vgchange -ay
> >mount /dev/mapper/cl--data5-data5 /data5/
> >after this i'm able to see all contents of the disk, and in fact write
> >to
> >it. So the earlier reported connection error doesn't seem to apply
> >here?
> >This is actually how i'm currently running the VM, where i detached the
> >disk, and mounted it  in the VM via the loop device. The disk is a data
> >disk for a heavily loaded mysql instance, and mysql is reporting no
> >errors,
> >and has been running for about a day now.
> >Of course this not the way it should run, but it is at least working,
> >only
> >performance seems a bit off. So i would like to solve the issue and
> >being
> >able to attach the image as disk again.
> >
> >for the second issue where the Image of the HostedEngine couldn't be
> >found,
> >also all seems correct;
> >The file is there and having the correct permissions;
> > ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9
>
> >\:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/
> >total 49406333
> >drwxr-xr-x.  2 vdsm kvm        4096 Jan 25 12:03 .
> >drwxr-xr-x. 13 vdsm kvm        4096 Jan 25 14:16 ..
> >-rw-rw----.  1 vdsm kvm 62277025792 Jan 23 03:04
> >38e4fba7-f140-4630-afab-0f744ebe3b57
> >-rw-rw----.  1 vdsm kvm     1048576 Jan 25 21:48
> >38e4fba7-f140-4630-afab-0f744ebe3b57.lease
> >-rw-r--r--.  1 vdsm kvm         285 Jan 27  2018
> >38e4fba7-f140-4630-afab-0f744ebe3b57.meta
> >And i'm able to mount the image using a loop device and access it's
> >contents.
> >Unfortunate the VM wouldn't boot due to XFS error's. After tinkering
> >with
> >this for about a day to make it boot, i gave up and restored from a
> >recent
> >backup. But i took the data dir from postgress from the mounted old
> >image
> >to the new VM, and postgress was perfectly fine with it, also
> >indicating
> >the image wasn't a complete toast.
> >
> >And the last issue where the storage domain wouldn't activate. The file
> >it
> >claims it cannot read in the log is perfectly readable and writable;
> >cat /rhev/data-center/mnt/glusterSD/10.201.0.11:
> >_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
> >CLASS=Data
> >DESCRIPTION=ovirt-mon-2
> >IOOPTIMEOUTSEC=10
> >LEASERETRIES=3
> >LEASETIMESEC=60
> >LOCKPOLICY=
> >LOCKRENEWALINTERVALSEC=5
> >POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3
> >REMOTE_PATH=10.201.0.11:/ovirt-mon-2
> >ROLE=Regular
> >SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458
> >TYPE=GLUSTERFS
> >VERSION=4
> >_SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807
> >
> >So i've no clue where these "Transport endpoint is not connected"  are
> >coming from, and how to resolve them?
> >
> >I think there are 4 possible causes for this issue;
> >1) I was trying to optimize the throughput of gluster on some volumes,
> >since we recently gained some additional write load, which we had
> >difficulty keeping up with. So I tried to incrementally
> >add server.event-threads, via;
> >gluster v set ovirt-data server.event-threads X
> >since this didn't seem to improve the performance i changed it back to
> >it's
> >original values. But when i did that the VM's running on these volumes
> >all
> >locked-up, and required a reboot, which was by than still possible.
> >Please
> >note for the volumes ovirt-engine and ovirt-mon-2 this setting wasn't
> >changed.
> >
> >2) I had a mix of running gluster 6.6 and 6.7, since i was in the
> >middle of
> >upgrading all to 6.7
> >
> >3) On one of the physical brick nodes, after a reboot xfs errors were
> >reported, and resolved by xfs_repair, which did remove some inodes in
> >the
> >process. For which i wasn't too worried since i would expect the
> >gluster
> >self healing daemon would resolve them, which seemed true for all
> >volumes,
> >except 1, where 1 gfid was pending for about 2 days. in this case also
> >exactly the image which vdsm reports it cannot resolve the size from.
> >But
> >there are other vm image with the same issue, which i left out for
> >brevity.
> >However the pending heal of the single gfid resolved once I mounted the
> >image via the loop device and started writing to. Which is probably due
> >the
> >nature on how gluster resolves what needs healing. Despite a gluster
> >heal X
> >full was issued before.
> >I could also confirm the pending gfid was in fact missing on the brick
> >node
> >on the underlying brick directory, while the heal was still pending.
> >
> >4) I did some brick replace's (only the ovirt-data volume) but only of
> >arbiter bricks of the affected volume in the first issue.
> >
> >the volume info's of the affected bricks look like this;
> >
> >Volume Name: ovirt-data
> >Type: Distributed-Replicate
> >Volume ID: 2775dc10-c197-446e-a73f-275853d38666
> >Status: Started
> >Snapshot Count: 0
> >Number of Bricks: 4 x (2 + 1) = 12
> >Transport-type: tcp
> >Bricks:
> >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data
> >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data
> >Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
> >Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data
> >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data
> >Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
> >Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data
> >Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data
> >Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
> >Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data
> >Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data
> >Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
> >Options Reconfigured:
> >cluster.choose-local: off
> >server.outstanding-rpc-limit: 1024
> >storage.owner-gid: 36
> >storage.owner-uid: 36
> >transport.address-family: inet
> >performance.readdir-ahead: on
> >nfs.disable: on
> >performance.quick-read: off
> >performance.read-ahead: off
> >performance.io-cache: off
> >performance.stat-prefetch: off
> >performance.low-prio-threads: 32
> >network.remote-dio: off
> >cluster.eager-lock: enable
> >cluster.quorum-type: auto
> >cluster.server-quorum-type: server
> >cluster.data-self-heal-algorithm: full
> >cluster.locking-scheme: granular
> >cluster.shd-max-threads: 8
> >cluster.shd-wait-qlength: 10000
> >features.shard: on
> >user.cifs: off
> >performance.write-behind-window-size: 512MB
> >performance.cache-size: 384MB
> >server.event-threads: 5
> >performance.strict-o-direct: on
> >cluster.brick-multiplex: on
> >
> >Volume Name: ovirt-engine
> >Type: Distributed-Replicate
> >Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc
> >Status: Started
> >Snapshot Count: 0
> >Number of Bricks: 3 x 3 = 9
> >Transport-type: tcp
> >Bricks:
> >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine
> >Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine
> >Options Reconfigured:
> >performance.strict-o-direct: on
> >performance.write-behind-window-size: 512MB
> >features.shard-block-size: 64MB
> >performance.cache-size: 128MB
> >nfs.disable: on
> >transport.address-family: inet
> >performance.quick-read: off
> >performance.read-ahead: off
> >performance.io-cache: off
> >performance.low-prio-threads: 32
> >network.remote-dio: enable
> >cluster.eager-lock: enable
> >cluster.quorum-type: auto
> >cluster.server-quorum-type: server
> >cluster.data-self-heal-algorithm: full
> >cluster.locking-scheme: granular
> >cluster.shd-max-threads: 8
> >cluster.shd-wait-qlength: 10000
> >features.shard: on
> >user.cifs: off
> >storage.owner-uid: 36
> >storage.owner-gid: 36
> >cluster.brick-multiplex: on
> >
> >Volume Name: ovirt-mon-2
> >Type: Replicate
> >Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd
> >Status: Started
> >Snapshot Count: 0
> >Number of Bricks: 1 x (2 + 1) = 3
> >Transport-type: tcp
> >Bricks:
> >Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2
> >Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2
> >Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter)
> >Options Reconfigured:
> >performance.client-io-threads: on
> >nfs.disable: on
> >transport.address-family: inet
> >performance.quick-read: off
> >performance.read-ahead: off
> >performance.io-cache: off
> >performance.low-prio-threads: 32
> >network.remote-dio: off
> >cluster.eager-lock: enable
> >cluster.quorum-type: auto
> >cluster.server-quorum-type: server
> >cluster.data-self-heal-algorithm: full
> >cluster.locking-scheme: granular
> >cluster.shd-max-threads: 8
> >cluster.shd-wait-qlength: 10000
> >features.shard: on
> >user.cifs: off
> >cluster.choose-local: off
> >client.event-threads: 4
> >server.event-threads: 4
> >storage.owner-uid: 36
> >storage.owner-gid: 36
> >performance.strict-o-direct: on
> >performance.cache-size: 64MB
> >performance.write-behind-window-size: 128MB
> >features.shard-block-size: 64MB
> >cluster.brick-multiplex: on
> >
> >Thanks Olaf
>
> Hi Olaf,
>
> Thanks  for the detailed output.
> On first glance I have noticed that you have a HostedEngine domain for
> both ovirt's engine VM + for other VMs , is that right?
> If yes, that's against best practices and not recommended.
> Second, you use brick multiplexing, but according to RH documentation -
> that feature is not supported for your workload - so in your case its
> drawing attention but should not be a problem.
>
> Can you specify how many physical hosts do you have ?
>
> I will try to check the output deeper, but I think you need to check:
> 1. Check gluster heal status - any pending heals should be resolved
> 2. Use telnet/nc/ncat/netcat to verify that each host sees the peers'
> brick ports.
> 3. gluster volume heal <volume> info should report all bricks arr connected
> gluster volume status must report all bricks have a pid
> 4. OPTIONAL - Try to create smaller (it's not a good idea to have large
> qcow2 disks) disks  via oVirt and assign them to your mysql. Then try to
> pvmove the LVs from the disk (mounted with loop) to the new disks - that
> way you can get rid of the old qcow disk .
> 5. What is your oVirt version ? Could it be an old 3.x ?
>
> Don't forget to backup :)
>
> Best Regards,
> Strahil Nikolov
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200129/7bd809ec/attachment.html>


More information about the Gluster-users mailing list