<div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 29, 2019 at 12:47 PM Krutika Dhananjay &lt;<a href="mailto:kdhananj@redhat.com">kdhananj@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div>Questions/comments inline ...<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 28, 2019 at 10:18 PM &lt;<a href="mailto:olaf.buitelaar@gmail.com" target="_blank">olaf.buitelaar@gmail.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear All,<br>

<br>

I wanted to share my experience upgrading from 4.2.8 to 4.3.1. While previous upgrades from 4.1 to 4.2 etc. went rather smooth, this one was a different experience. After first trying a test upgrade on a 3 node setup, which went fine. i headed to upgrade the 9 node production platform, unaware of the backward compatibility issues between gluster 3.12.15 -&gt; 5.3. After upgrading 2 nodes, the HA engine stopped and wouldn&#39;t start. Vdsm wasn&#39;t able to mount the engine storage domain, since /dom_md/metadata was missing or couldn&#39;t be accessed. Restoring this file by getting a good copy of the underlying bricks, removing the file from the underlying bricks where the file was 0 bytes and mark with the stickybit, and the corresponding gfid&#39;s. Removing the file from the mount point, and copying back the file on the mount point. Manually mounting the engine domain,  and manually creating the corresponding symbolic links in /rhev/data-center and /var/run/vdsm/storage and fixing the ownership back to vdsm.kvm (which was root.root), i was able to start the HA engine again. Since the engine was up again, and things seemed rather unstable i decided to continue the upgrade on the other nodes suspecting an incompatibility in gluster versions, i thought would be best to have them all on the same version rather soonish. However things went from bad to worse, the engine stopped again, and all vm’s stopped working as well.  So on a machine outside the setup and restored a backup of the engine taken from version 4.2.8 just before the upgrade. With this engine I was at least able to start some vm’s again, and finalize the upgrade. Once the upgraded, things didn’t stabilize and also lose 2 vm’s during the process due to image corruption. After figuring out gluster 5.3 had quite some issues I was as lucky to see gluster 5.5 was about to be released, on the moment the RPM’s were available I’ve installed those. This helped a lot in terms of stability, for which I’m very grateful! However the performance is unfortunate terrible, it’s about 15% of what the performance was running gluster 3.12.15. It’s strange since a simple dd shows ok performance, but our actual workload doesn’t. While I would expect the performance to be better, due to all improvements made since gluster version 3.12. Does anybody share the same experience?<br>

I really hope gluster 6 will soon be tested with ovirt and released, and things start to perform and stabilize again..like the good old days. Of course when I can do anything, I’m happy to help.<br>

<br>

I think the following short list of issues we have after the migration;<br>

Gluster 5.5;<br>

-       Poor performance for our workload (mostly write dependent)<br></blockquote><div><br></div><div>For this, could you share the volume-profile output specifically for the affected volume(s)? Here&#39;s what you need to do -</div><div><br></div><div>1. # gluster volume profile $VOLNAME stop</div><div>2. # gluster volume profile $VOLNAME start</div><div>3. Run the test inside the vm wherein you see bad performance</div><div>4. # gluster volume profile $VOLNAME info # save the output of this command into a file</div><div>5. # gluster volume profile $VOLNAME stop</div><div>6. and attach the output file gotten in step 4<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

-       VM’s randomly pause on un <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">known storage errors, which are “stale file’s”. corresponding log; Lookup on shard 797 failed. Base file gfid = 8a27b91a-ff02-42dc-bd4c-caa019424de8 [Stale file handle]<br></blockquote><div><br></div><div>Could you share the complete gluster client log file (it would be a filename matching the pattern rhev-data-center-mnt-glusterSD-*)</div><div>Also the output of `gluster volume info $VOLNAME`</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

-       Some files are listed twice in a directory (probably related the stale file issue?)<br>

Example;<br>

ls -la  /rhev/data-center/59cd53a9-0003-02d7-00eb-0000000001e3/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/4add6751-3731-4bbd-ae94-aaeed12ea450/<br>

total 3081<br>

drwxr-x---.  2 vdsm kvm    4096 Mar 18 11:34 .<br>

drwxr-xr-x. 13 vdsm kvm    4096 Mar 19 09:42 ..<br>

-rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55 1a7cf259-6b29-421d-9688-b25dfaafb13c<br>

-rw-rw----.  1 vdsm kvm 1048576 Mar 28 12:55 1a7cf259-6b29-421d-9688-b25dfaafb13c<br>

-rw-rw----.  1 vdsm kvm 1048576 Jan 27  2018 1a7cf259-6b29-421d-9688-b25dfaafb13c.lease<br>

-rw-r--r--.  1 vdsm kvm     290 Jan 27  2018 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta<br>

-rw-r--r--.  1 vdsm kvm     290 Jan 27  2018 1a7cf259-6b29-421d-9688-b25dfaafb13c.meta<br></blockquote><div><br></div><div>Adding DHT and readdir-ahead maintainers regarding entries getting listed twice.</div><div><a class="gmail_plusreply" id="gmail-m_-6527618216840402220plusReplyChip-0" href="mailto:nbalacha@redhat.com" target="_blank">@Nithya Balachandran</a> ^^</div><div><a class="gmail_plusreply" id="gmail-m_-6527618216840402220plusReplyChip-1" href="mailto:rgowdapp@redhat.com" target="_blank">@Gowdappa, Raghavendra</a> ^^</div><div><a class="gmail_plusreply" id="gmail-m_-6527618216840402220plusReplyChip-2" href="mailto:pgurusid@redhat.com" target="_blank">@Poornima Gurusiddaiah</a> ^^<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

- brick processes sometimes starts multiple times. Sometimes I’ve 5 brick processes for a single volume. Killing all glusterfsd’s for the volume on the machine and running gluster v start &lt;vol&gt; force usually just starts one after the event, from then on things look all right. <br></blockquote><div><br></div><div>Did you mean 5 brick processes for a single brick directory?</div><div><a class="gmail_plusreply" id="gmail-m_-6527618216840402220plusReplyChip-5" href="mailto:moagrawa@redhat.com" target="_blank">+Mohit Agrawal</a> ^^</div></div></div></div></blockquote><div><br></div><div>Mohit - Could this be because of missing the following commit in release-5 branch? It might be worth to backport this fix.<br></div><div><br></div><div>commit 66986594a9023c49e61b32769b7e6b260b600626<br>Author: Mohit Agrawal &lt;<a href="mailto:moagrawal@redhat.com">moagrawal@redhat.com</a>&gt;<br>Date:   Fri Mar 1 13:41:24 2019 +0530<br><br>    glusterfsd: Multiple shd processes are spawned on brick_mux environment<br>    <br>    Problem: Multiple shd processes are spawned while starting volumes<br>             in the loop on brick_mux environment.glusterd spawn a process<br>             based on a pidfile and shd daemon is taking some time to<br>             update pid in pidfile due to that glusterd is not able to<br>             get shd pid<br>    <br>    Solution: Commit cd249f4cb783f8d79e79468c455732669e835a4f changed<br>              the code to update pidfile in parent for any gluster daemon<br>              after getting the status of forking child in parent.To resolve<br>              the same correct the condition update pidfile in parent only<br>              for glusterd and for rest of the daemon pidfile is updated in<br>              child<br>    <br>    Change-Id: Ifd14797fa949562594a285ec82d58384ad717e81<br>    fixes: bz#1684404<br>    Signed-off-by: Mohit Agrawal &lt;<a href="mailto:moagrawal@redhat.com">moagrawal@redhat.com</a>&gt;<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div><br></div><div>-Krutika</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Ovirt 4.3.2.1-1.el7<br>

-       All vms images ownership are changed to root.root after the vm is shutdown, probably related to; <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1666795" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1666795</a> but not only scoped to the HA engine. I’m still in compatibility mode 4.2 for the cluster and for the vm’s, but upgraded to version ovirt 4.3.2<br>

-       The network provider is set to ovn, which is fine..actually cool, only the “ovs-vswitchd” is a CPU hog, and utilizes 100%<br>

-       It seems on all nodes vdsm tries to get the the stats for the HA engine, which is filling the logs with (not sure if this is new);<br>

[api.virt] FINISH getStats return={&#39;status&#39;: {&#39;message&#39;: &quot;Virtual machine does not exist: {&#39;vmId&#39;: u&#39;20d69acd-edfd-4aeb-a2ae-49e9c121b7e9&#39;}&quot;, &#39;code&#39;: 1}} from=::1,59290, vmId=20d69acd-edfd-4aeb-a2ae-49e9c121b7e9 (api:54)<br>

-       It seems the package os_brick [root] managedvolume not supported: Managed Volume Not Supported. Missing package os-brick.: (&#39;Cannot import os_brick&#39;,) (caps:149)  which fills the vdsm.log, but for this I also saw another message, so I suspect this will already be resolved shortly<br>

-       The machine I used to run the backup HA engine, doesn’t want to get removed from the hosted-engine –vm-status, not even after running; hosted-engine --clean-metadata --host-id=10 --force-clean or hosted-engine --clean-metadata --force-clean from the machine itself.<br>

<br>

Think that&#39;s about it.<br>

<br>

Don’t get me wrong, I don’t want to rant, I just wanted to share my experience and see where things can made better. <br>

<br>

<br>

Best Olaf<br>

_______________________________________________<br>

Users mailing list -- <a href="mailto:users@ovirt.org" target="_blank">users@ovirt.org</a><br>

To unsubscribe send an email to <a href="mailto:users-leave@ovirt.org" target="_blank">users-leave@ovirt.org</a><br>

Privacy Statement: <a href="https://www.ovirt.org/site/privacy-policy/" rel="noreferrer" target="_blank">https://www.ovirt.org/site/privacy-policy/</a><br>

oVirt Code of Conduct: <a href="https://www.ovirt.org/community/about/community-guidelines/" rel="noreferrer" target="_blank">https://www.ovirt.org/community/about/community-guidelines/</a><br>

List Archives: <a href="https://lists.ovirt.org/archives/list/users@ovirt.org/message/3CO35Q7VZMWNHS4LPUJNO7S47MGLSKS5/" rel="noreferrer" target="_blank">https://lists.ovirt.org/archives/list/users@ovirt.org/message/3CO35Q7VZMWNHS4LPUJNO7S47MGLSKS5/</a><br>

</blockquote></div></div></div>

_______________________________________________<br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div></div>