[Gluster-users] [ovirt-users] Re: Announcing Gluster release 5.5

Fri Mar 29 07:16:33 UTC 2019

Questions/comments inline ...

On Thu, Mar 28, 2019 at 10:18 PM <olaf.buitelaar at gmail.com> wrote:

> Dear All,
>
> I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While
> previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a
> different experience. After first trying a test upgrade on a 3 node setup,
> which went fine. i headed to upgrade the 9 node production platform,
> unaware of the backward compatibility issues between gluster 3.12.15 ->
> 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn't start.
> Vdsm wasn't able to mount the engine storage domain, since /dom_md/metadata
> was missing or couldn't be accessed. Restoring this file by getting a good
> copy of the underlying bricks, removing the file from the underlying bricks
> where the file was 0 bytes and mark with the stickybit, and the
> corresponding gfid's. Removing the file from the mount point, and copying
> back the file on the mount point. Manually mounting the engine domain,  and
> manually creating the corresponding symbolic links in /rhev/data-center and
> /var/run/vdsm/storage and fixing the ownership back to vdsm.kvm (which was
> root.root), i was able to start the HA engine again. Since the engine was
> up again, and things seemed rather unstable i decided to continue the
> upgrade on the other nodes suspecting an incompatibility in gluster
> versions, i thought would be best to have them all on the same version
> rather soonish. However things went from bad to worse, the engine stopped
> again, and all vm’s stopped working as well.  So on a machine outside the
> setup and restored a backup of the engine taken from version 4.2.8 just
> before the upgrade. With this engine I was at least able to start some vm’s
> again, and finalize the upgrade. Once the upgraded, things didn’t stabilize
> and also lose 2 vm’s during the process due to image corruption. After
> figuring out gluster 5.3 had quite some issues I was as lucky to see
> gluster 5.5 was about to be released, on the moment the RPM’s were
> available I’ve installed those. This helped a lot in terms of stability,
> for which I’m very grateful! However the performance is unfortunate
> terrible, it’s about 15% of what the performance was running gluster
> 3.12.15. It’s strange since a simple dd shows ok performance, but our
> actual workload doesn’t. While I would expect the performance to be better,
> due to all improvements made since gluster version 3.12. Does anybody share
> the same experience?
> I really hope gluster 6 will soon be tested with ovirt and released, and
> things start to perform and stabilize again..like the good old days. Of
> course when I can do anything, I’m happy to help.
>
> I think the following short list of issues we have after the migration;
> Gluster 5.5;
> -       Poor performance for our workload (mostly write dependent)
>

For this, could you share the volume-profile output specifically for the
affected volume(s)? Here's what you need to do -

1. # gluster volume profile $VOLNAME stop
2. # gluster volume profile $VOLNAME start
3. Run the test inside the vm wherein you see bad performance
4. # gluster volume profile $VOLNAME info # save the output of this command
into a file
5. # gluster volume profile $VOLNAME stop
6. and attach the output file gotten in step 4

-       VM’s randomly pause on un
>
known storage errors, which are “stale file’s”. corresponding log; Lookup
> on shard 797 failed. Base file gfid = 8a27b91a-ff02-42dc-bd4c-caa019424de8
> [Stale file handle]
>

Could you share the complete gluster client log file (it would be a
filename matching the pattern rhev-data-center-mnt-glusterSD-*)
Also the output of `gluster volume info $VOLNAME`

> -       Some files are listed twice in a directory (probably related the
> stale file issue?)
> Example;
> ls -la
> /rhev/data-center/59cd53a9-0003-02d7-00eb-0000000001e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/
> total 3081
> drwxr-x---.  2 vdsm kvm    4096 Mar 18 11:34 .
> drwxr-xr-x. 13 vdsm kvm    4096 Mar 19 09:42 ..
> -rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55
> 1a7cf259-6b29-421d-9688-b25dfaafb13c
> -rw-rw----.  1 vdsm kvm 1048576 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.lease
> -rw-r--r--.  1 vdsm kvm     290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
> -rw-r--r--.  1 vdsm kvm     290 Jan 27  2018
> 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta
>

Adding DHT and readdir-ahead maintainers regarding entries getting listed
twice.
@Nithya Balachandran <nbalacha at redhat.com> ^^
@Gowdappa, Raghavendra <rgowdapp at redhat.com> ^^
@Poornima Gurusiddaiah <pgurusid at redhat.com> ^^

>
> - brick processes sometimes starts multiple times. Sometimes I’ve 5 brick
> processes for a single volume. Killing all glusterfsd’s for the volume on
> the machine and running gluster v start <vol> force usually just starts one
> after the event, from then on things look all right.
>

Did you mean 5 brick processes for a single brick directory?
+Mohit Agrawal <moagrawa at redhat.com> ^^

-Krutika

> Ovirt 4.3.2.1-1.el7
> -       All vms images ownership are changed to root.root after the vm is
> shutdown, probably related to;
> https://bugzilla.redhat.com/show_bug.cgi?id=1666795 but not only scoped
> to the HA engine. I’m still in compatibility mode 4.2 for the cluster and
> for the vm’s, but upgraded to version ovirt 4.3.2
> -       The network provider is set to ovn, which is fine..actually cool,
> only the “ovs-vswitchd” is a CPU hog, and utilizes 100%
> -       It seems on all nodes vdsm tries to get the the stats for the HA
> engine, which is filling the logs with (not sure if this is new);
> [api.virt] FINISH getStats return={'status': {'message': "Virtual machine
> does not exist: {'vmId': u'20d69acd-edfd-4aeb-a2ae-49e9c121b7e9'}", 'code':
> 1}} from=::1,59290, vmId=20d69acd-edfd-4aeb-a2ae-49e9c121b7e9 (api:54)
> -       It seems the package os_brick [root] managedvolume not supported:
> Managed Volume Not Supported. Missing package os-brick.: ('Cannot import
> os_brick',) (caps:149)  which fills the vdsm.log, but for this I also saw
> another message, so I suspect this will already be resolved shortly
> -       The machine I used to run the backup HA engine, doesn’t want to
> get removed from the hosted-engine –vm-status, not even after running;
> hosted-engine --clean-metadata --host-id=10 --force-clean or hosted-engine
> --clean-metadata --force-clean from the machine itself.
>
> Think that's about it.
>
> Don’t get me wrong, I don’t want to rant, I just wanted to share my
> experience and see where things can made better.
>
>
> Best Olaf
> _______________________________________________
> Users mailing list -- users at ovirt.org
> To unsubscribe send an email to users-leave at ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/3CO35Q7VZMWNHS4LPUJNO7S47MGLSKS5/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190329/7c7586d9/attachment.html>