<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 30, 2017 at 1:31 AM, Jan <span dir="ltr">&lt;<a href="mailto:jan.h.zak@gmail.com" target="_blank">jan.h.zak@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi all,</div><div><br></div><div>Gluster and Ganesha are amazing. Thank you for this great work!</div><div><br></div><div>I’m struggling with one issue and I think that you might be able to help me.</div><div><br></div><div>I spent some time by playing with Gluster and Ganesha and after I gain some experience I decided that I should go into production but I’m still struggling with one issue.</div><div> </div><div>I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.</div><div><br></div><div>Servers have a lot of resources and they run in a subnet on a stable network.</div><div><br></div><div>I didn’t have any issues when I tested a single brick. But now I’d like to setup 17 replicated bricks and I realized that when I restart one of nodes then the result looks like this:</div><div><br></div><div>sudo gluster volume status | grep &#39; N &#39;</div><div><br></div><div>Brick glunode0:/st/brick3/dir          N/A       N/A        N       N/A  </div><div>Brick glunode1:/st/brick2/dir          N/A       N/A        N       N/A  </div><div><br></div><div>Some bricks just don’t go online. Sometime it’s one brick, sometime tree and it’s not same brick – it’s random issue.</div><div><br></div><div>I checked log on affected servers and this is an example:</div><div><br></div><div>sudo tail /var/log/glusterfs/bricks/st-<wbr>brick3-0.log </div><div><br></div><div>[2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on <a href="http://10.2.44.23:24007" target="_blank">10.2.44.23:24007</a> failed (No data available)</div><div>[2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_<wbr>rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data available)</div><div>[2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_<wbr>rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers</div><div>[2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_<wbr>and_exit] (--&gt;/lib64/libpthread.so.0(+<wbr>0x7dc5) [0x7f3158032dc5] --&gt;/usr/sbin/glusterfsd(<wbr>glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5] --&gt;/usr/sbin/glusterfsd(<wbr>cleanup_and_exit+0x6b) [0x7f31596cbdfb] ) 0-:received signum (15), shutting down</div><div>[2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs: connection attempt on <a href="http://10.2.44.23:24007" target="_blank">10.2.44.23:24007</a> failed, (Network is unreachable)</div></div></blockquote><div><br></div><div>This happens when connect () syscall fails with  ENETUNREACH errno as per the followint code<br><br>                if (ign_enoent) {                                               <br>                        ret = connect_loop (priv-&gt;sock,                         <br>                                            SA (&amp;this-&gt;peerinfo.sockaddr),      <br>                                            this-&gt;peerinfo.sockaddr_len);       <br>                } else {                                                        <br>                        ret = connect (priv-&gt;sock,                              <br>                                       SA (&amp;this-&gt;peerinfo.sockaddr),           <br>                                       this-&gt;peerinfo.sockaddr_len);            <br>                }                                                                   <br>                                                                                    <br>                if (ret == -1 &amp;&amp; errno == ENOENT &amp;&amp; ign_enoent) {                   <br>                        gf_log (this-&gt;name, GF_LOG_WARNING,                         <br>                               &quot;Ignore failed connection attempt on %s, (%s) &quot;, <br>                                this-&gt;peerinfo.identifier, strerror (errno));   <br>                                                                                    <br>                        /* connect failed with some other error than EINPROGRESS<br>                        so, getsockopt (... SO_ERROR ...), will not catch any   <br>                        errors and return them to us, we need to remember this  <br>                        state, and take actions in socket_event_handler             <br>                        appropriately */                                            <br>                        /* TBD: What about ENOENT, we will do getsockopt there  <br>                        as well, so how is that exempt from such a problem? */  <br>                        priv-&gt;connect_failed = 1;                                   <br>                        this-&gt;connect_failed = _gf_true;                            <br>                                                                                    <br>                        goto handler;                                               <br>                }                                                                   <br>                                                                                    <br>                if (ret == -1 &amp;&amp; ((errno != EINPROGRESS) &amp;&amp; (errno != ENOENT))) {<br>                        /* For unix path based sockets, the socket path is          <br>                         * cryptic (md5sum of path) and may not be useful for   <br>                         * the user in debugging so log it in DEBUG                 <br>                         */                                                         <br>                        gf_log (this-&gt;name, ((sa_family == AF_UNIX) ?      &lt;===== this is the log which gets generated         <br>                                GF_LOG_DEBUG : GF_LOG_ERROR),                       <br>                                &quot;connection attempt on %s failed, (%s)&quot;,            <br>                                this-&gt;peerinfo.identifier, strerror (errno));   <br><br></div><div>IMO, this can only happen if there is an intermittent n/w failure? <br><br></div><div>@Raghavendra G/ Mohit - do you have any other opinion?<br></div><div><br>[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_<wbr>request] 0-glusterfs: not connected (priv-&gt;connected = 0)</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>[2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_<wbr>submit] 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)</div><div><br></div><div>I think that important message is “Network is unreachable”.</div><div><br></div><div>Question</div><div>1. Could you please tell me, is that normal when you have many bricks? Networks is definitely stable and other servers use it without problem and all servers run on a same pair of switches. My assumption is that in the same time many bricks try to connect and that doesn’t work.</div><div><br></div><div>2. Is there an option to configure a brick to enable some kind of autoreconnect or add some timeout?</div><div>gluster volume set brick123 option456 abc ??</div><div><br></div><div>3. What it the recommend way to fix offline brick on the affected server? I don’t want to use “gluster volume stop/start” since affected bricks are online on other server and there is no reason to completely turn it off.</div><div><br></div><div>Thank you,</div><div>Jan</div></div>
<br>______________________________<wbr>_________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br></blockquote></div><br></div></div>