<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">
<p style="margin-top:0;margin-bottom:0">Hello,&nbsp; I've been struggling to figure out a few issues that I've been having with my 3 node glusterfs setup.&nbsp; We have been experiencing an issue where one of the gluster nodes decides that it can't communicate with another
 node and it seems like it restarts glusterd which then causes glusterd to restart on the node it can't communicate with and in the end causes the 3rd node to loose all communication with any of the other 2 nodes which causes it to restart glusterd due to quorum
 being lost.&nbsp; Which means I end of with ovirt VM that is critical to our business to stop responding while all of this is taking place.&nbsp; I've checked with our network team and they can't seem to find any issues on the 10gbe switch these systems are all connected
 to for glusterfs communications so I'm at a lost as to whats causing this.&nbsp; Our setup is as follows.</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">san1 10.4.16.11 (10Gbe IP)&nbsp; These all have a 1Gb public facing interface with access to the web.</p>
<p style="margin-top:0;margin-bottom:0">san2 10.4.16.12<span>&nbsp;(10Gbe IP)&nbsp; These all have a 1Gb public facing interface with access to the web.</span></p>
<p style="margin-top:0;margin-bottom:0">san3 10.4.16.19<span>&nbsp;(10Gbe IP)&nbsp; These all have a 1Gb public facing interface with access to the web.</span></p>
<p style="margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style="margin-top:0;margin-bottom:0"><span>hv1-7&nbsp; All communicating on the same 10Gbe switch to the glusterfs sans</span></p>
<p style="margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style="margin-top:0;margin-bottom:0"><span>The first log entry around the time of glusterfsd restart is on san2.</span></p>
<p style="margin-top:0;margin-bottom:0"><span><span>[2018-07-11 19:16:09.130303] W [socket.c:593:__socket_rwv] 0-management: readv on 10.4.16.11:24007 failed (Connection timed out)</span><br>
</span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><br>
</span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span>Followed by san1</span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span>[2018-07-11 19:16:09.169704] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer &lt;10.4.16.12&gt;</span><br>
</span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><br>
</span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span>Back on san2</span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span>[2018-07-11 19:16:09.172360] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer &lt;10.4.16.11&gt; (&lt;0f3090ee-080b-4a6b-9964-0ca86d801469&gt;), in state &lt;Peer in Cluster&gt;,
 has disconnected from glusterd.</span><br>
</span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><br>
</span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span>on san1</span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>[2018-07-11 19:16:09.194170] W [glusterd-locks.c:843:glusterd_mgmt_v3_unlock] (--&gt;/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(&#43;0x2322a) [0x7f041e73a22a] --&gt;/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(&#43;0x2d198)
 [0x7f041e744198] --&gt;/usr/lib64/glusterfs/3.12.6/xlator/mgmt/glusterd.so(&#43;0xe4765) [0x7f041e7fb765] ) 0-management: Lock for vol EXPORTB not held</span><br>
</span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>and it seems to spiral from there at around 19:16:12.488534 san3 chimes in about san2 connection failed.&nbsp; Then a few seconds later decides san1 is also not accessible causing it to also restart
 glusterd.</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>Mean while on one of the HV (one with the critical vm running on it) I see this log entry around the time this all starts.</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>rhev-data-center-mnt-glusterSD-10.4.16.11\:gv1.log</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span></p>
<div>[2018-07-11 19:16:14.918389] W [socket.c:593:__socket_rwv] 0-gv1-client-2: readv on 10.4.16.19:49153 failed (No data available)</div>
<div><br>
</div>
</span></span></span></span></span>
<p></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>I've attached the full logs from the 19:16 time frame.</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>I need some help figuring out what could be causing this issue and what to check next.</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>Additonal information:</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>Glusterfs v3.12.6-1 is running on all 3 sans.&nbsp; I had to stop patching them due to the fact that if I did so one at a time and rebooted and allowed them to completely heal before patching
 and rebooting&nbsp; the next would 100% cause a similar issue.</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span>Glusterfs v3.12.11-1 is running on all 7 hv in the ovirt cloud.</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span><br>
</span></span></span></span></span></p>
<p style="margin-top:0;margin-bottom:0"><span><span><span><span><span></p>
<div># gluster volume info gv1</div>
<div>&nbsp;</div>
<div>Volume Name: gv1</div>
<div>Type: Replicate</div>
<div>Volume ID: ea12f72d-a228-43ba-a360-4477cada292a</div>
<div>Status: Started</div>
<div>Snapshot Count: 0</div>
<div>Number of Bricks: 1 x 3 = 3</div>
<div>Transport-type: tcp</div>
<div>Bricks:</div>
<div>Brick1: 10.4.16.19:/glusterfs/data1/gv1</div>
<div>Brick2: 10.4.16.11:/glusterfs/data1/gv1</div>
<div>Brick3: 10.4.16.12:/glusterfs/data1/gv1</div>
<div>Options Reconfigured:</div>
<div>network.ping-timeout: 50</div>
<div>nfs.register-with-portmap: on</div>
<div>nfs.export-volumes: on</div>
<div>nfs.addr-namelookup: off</div>
<div>cluster.server-quorum-type: server</div>
<div>cluster.quorum-type: auto</div>
<div>network.remote-dio: enable</div>
<div>cluster.eager-lock: enable</div>
<div>performance.stat-prefetch: off</div>
<div>performance.io-cache: off</div>
<div>performance.read-ahead: off</div>
<div>performance.quick-read: off</div>
<div>storage.owner-uid: 36</div>
<div>storage.owner-gid: 36</div>
<div>server.allow-insecure: on</div>
<div>nfs.disable: off</div>
<div>nfs.rpc-auth-allow: 10.4.16.*</div>
<div>auth.allow: 10.4.16.*</div>
<div>cluster.self-heal-daemon: enable</div>
<div>diagnostics.latency-measurement: on</div>
<div>diagnostics.count-fop-hits: on</div>
<div>cluster.server-quorum-ratio: 51%</div>
<div><br>
</div>
We have another exportb gluster volume and it's cofigured similarly but since it's not used for much it's not running into this issue.</span></span></span></span></span>
<p></p>
</div>
<div></div>
<p><span style="font-size:10.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:black">Edward Clay</span>
<br>
<span style="font-size:8.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:black">Systems Administrator</span><br>
<span style="font-size:8.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#019EEB"><a href="http://www.thehutgroup.com/" target="_blank"><span style="color:#575a5d;
text-decoration:none;text-underline:none">The Hut Group</span></a></span>
<br>
<br>
<span style="font-size:8.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:black">Tel: </span>
<br>
<span style="font-size:8.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:black">Email:
<a href="mailto:edward.clay@uk2group.com">edward.clay@uk2group.com</a></span></p>
<p style="margin-bottom:12.0pt"><br>
<span style="font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#666666">For the purposes of this email, the &quot;company&quot; means The Hut Group Limited, a company registered in England and Wales (company number 6539496) whose registered office is at Fifth Floor,
 Voyager House, Chicago Avenue, Manchester Airport, M90 3DQ and/or any of its respective subsidiaries.</span>
<br>
<br>
<b><span style="font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#666666">Confidentiality Notice</span></b>
<br>
<span style="font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#666666">This e-mail is confidential and intended for the use of the named recipient only. If you are not the intended recipient please notify us by telephone immediately on &#43;44(0)1606 811888
 or return it to us by e-mail. Please then delete it from your system and note that any use, dissemination, forwarding, printing or copying is strictly prohibited. Any views or opinions are solely those of the author and do not necessarily represent those of
 the company.</span> <br>
<br>
<b><span style="font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#666666">Encryptions and Viruses</span></b>
<br>
<span style="font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#666666">Please note that this e-mail and any attachments have not been encrypted. They may therefore be liable to be compromised. Please also note that it is your responsibility to scan this
 e-mail and any attachments for viruses. We do not, to the extent permitted by law, accept any liability (whether in contract, negligence or otherwise) for any virus infection and/or external compromise of security and/or confidentiality in relation to transmissions
 sent by e-mail.</span> <br>
<br>
<b><span style="font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#666666">Monitoring</span></b>
<br>
<span style="font-size:7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#666666">Activity and use of the company's systems is monitored to secure its effective use and operation and for other lawful business purposes. Communications using these systems will also
 be monitored and may be recorded to secure effective use and operation and for other lawful business purposes.</span>
</p>
<span style="font-size:4pt;color:#FFFFFF">hgvyjuv</span>
<div></div>
</body>
</html>