[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Thu Jun 29 20:01:22 UTC 2017

Hi all,

Gluster and Ganesha are amazing. Thank you for this great work!

I’m struggling with one issue and I think that you might be able to help me.

I spent some time by playing with Gluster and Ganesha and after I gain some
experience I decided that I should go into production but I’m still
struggling with one issue.

I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from
centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.

Servers have a lot of resources and they run in a subnet on a stable
network.

I didn’t have any issues when I tested a single brick. But now I’d like to
setup 17 replicated bricks and I realized that when I restart one of nodes
then the result looks like this:

sudo gluster volume status | grep ' N '

Brick glunode0:/st/brick3/dir          N/A       N/A        N       N/A
Brick glunode1:/st/brick2/dir          N/A       N/A        N       N/A

Some bricks just don’t go online. Sometime it’s one brick, sometime tree
and it’s not same brick – it’s random issue.

I checked log on affected servers and this is an example:

sudo tail /var/log/glusterfs/bricks/st-brick3-0.log

[2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs:
readv on 10.2.44.23:24007 failed (No data available)
[2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data
available)
[2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile servers
[2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5]
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] )
0-:received signum (15), shutting down
[2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs:
connection attempt on 10.2.44.23:24007 failed, (Network is unreachable)
[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request]
0-glusterfs: not connected (priv->connected = 0)
[2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit]
0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster
Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)

I think that important message is “Network is unreachable”.

Question
1. Could you please tell me, is that normal when you have many bricks?
Networks is definitely stable and other servers use it without problem and
all servers run on a same pair of switches. My assumption is that in the
same time many bricks try to connect and that doesn’t work.

2. Is there an option to configure a brick to enable some kind of
autoreconnect or add some timeout?
gluster volume set brick123 option456 abc ??

3. What it the recommend way to fix offline brick on the affected server? I
don’t want to use “gluster volume stop/start” since affected bricks are
online on other server and there is no reason to completely turn it off.

Thank you,
Jan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170629/02dfb57b/attachment.html>