[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?
Atin Mukherjee
amukherj at redhat.com
Fri Jun 30 11:07:56 UTC 2017
On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak at gmail.com> wrote:
> Hi all,
>
> Gluster and Ganesha are amazing. Thank you for this great work!
>
> I’m struggling with one issue and I think that you might be able to help
> me.
>
> I spent some time by playing with Gluster and Ganesha and after I gain
> some experience I decided that I should go into production but I’m still
> struggling with one issue.
>
> I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from
> centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.
>
> Servers have a lot of resources and they run in a subnet on a stable
> network.
>
> I didn’t have any issues when I tested a single brick. But now I’d like to
> setup 17 replicated bricks and I realized that when I restart one of nodes
> then the result looks like this:
>
> sudo gluster volume status | grep ' N '
>
> Brick glunode0:/st/brick3/dir N/A N/A N N/A
> Brick glunode1:/st/brick2/dir N/A N/A N N/A
>
> Some bricks just don’t go online. Sometime it’s one brick, sometime tree
> and it’s not same brick – it’s random issue.
>
> I checked log on affected servers and this is an example:
>
> sudo tail /var/log/glusterfs/bricks/st-brick3-0.log
>
> [2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs:
> readv on 10.2.44.23:24007 failed (No data available)
> [2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data
> available)
> [2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: Exhausted all volfile servers
> [2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit]
> (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
> -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5]
> -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] )
> 0-:received signum (15), shutting down
> [2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs:
> connection attempt on 10.2.44.23:24007 failed, (Network is unreachable)
>
This happens when connect () syscall fails with ENETUNREACH errno as per
the followint code
if (ign_enoent)
{
ret = connect_loop
(priv->sock,
SA
(&this->peerinfo.sockaddr),
this->peerinfo.sockaddr_len);
} else
{
ret = connect
(priv->sock,
SA
(&this->peerinfo.sockaddr),
this->peerinfo.sockaddr_len);
}
if (ret == -1 && errno == ENOENT && ign_enoent)
{
gf_log (this->name,
GF_LOG_WARNING,
"Ignore failed connection attempt on %s,
(%s) ",
this->peerinfo.identifier, strerror
(errno));
/* connect failed with some other error than
EINPROGRESS
so, getsockopt (... SO_ERROR ...), will not catch
any
errors and return them to us, we need to remember
this
state, and take actions in
socket_event_handler
appropriately
*/
/* TBD: What about ENOENT, we will do getsockopt
there
as well, so how is that exempt from such a problem?
*/
priv->connect_failed =
1;
this->connect_failed =
_gf_true;
goto
handler;
}
if (ret == -1 && ((errno != EINPROGRESS) && (errno !=
ENOENT))) {
/* For unix path based sockets, the socket path
is
* cryptic (md5sum of path) and may not be useful
for
* the user in debugging so log it in
DEBUG
*/
gf_log (this->name, ((sa_family == AF_UNIX) ?
<===== this is the log which gets generated
GF_LOG_DEBUG :
GF_LOG_ERROR),
"connection attempt on %s failed,
(%s)",
this->peerinfo.identifier, strerror
(errno));
IMO, this can only happen if there is an intermittent n/w failure?
@Raghavendra G/ Mohit - do you have any other opinion?
[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request]
0-glusterfs: not connected (priv->connected = 0)
> [2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit]
> 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster
> Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
>
> I think that important message is “Network is unreachable”.
>
> Question
> 1. Could you please tell me, is that normal when you have many bricks?
> Networks is definitely stable and other servers use it without problem and
> all servers run on a same pair of switches. My assumption is that in the
> same time many bricks try to connect and that doesn’t work.
>
> 2. Is there an option to configure a brick to enable some kind of
> autoreconnect or add some timeout?
> gluster volume set brick123 option456 abc ??
>
> 3. What it the recommend way to fix offline brick on the affected server?
> I don’t want to use “gluster volume stop/start” since affected bricks are
> online on other server and there is no reason to completely turn it off.
>
> Thank you,
> Jan
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170630/bf79d82e/attachment-0001.html>
More information about the Gluster-users
mailing list