[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Sun Jul 2 10:26:56 UTC 2017

Thank you, I created bug with all logs:
https://bugzilla.redhat.com/show_bug.cgi?id=1467050

During testing I found second bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1467057
There something wrong with Ganesha when Gluster bricks are named "w0" or
"sw0".

On Fri, Jun 30, 2017 at 11:36 AM, Hari Gowtham <hgowtham at redhat.com> wrote:

> Hi,
>
> Jan, by multiple times I meant whether you were able to do the whole
> setup multiple times and face the same issue.
> So that we have a consistent reproducer to work on.
>
> As grepping shows that the process doesn't exist the bug I mentioned
> doesn't hold good.
> Seems like another issue irrelevant to the bug i mentioned (have
> mentioned it now).
>
> When you say too often, this means there is a way to reproduce it.
> Please do let us know the steps you performed to check. but this
> shouldn't happen if you try again.
>
> You won't have this issue often. and as Mani mentioned do not write a
> script to start force it.
> If this issue exists with a proper reproducer we will take a look at it.
>
> Sorry, forgot to provide the link for the fix:
> patch : https://review.gluster.org/#/c/17101/
>
> If you find a reproducer do file a bug at
> https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS
>
>
> On Fri, Jun 30, 2017 at 3:33 PM, Manikandan Selvaganesh
> <manikandancs333 at gmail.com> wrote:
> > Hi Jan,
> >
> > It is not recommended that you automate the script for 'volume start
> force'.
> > Bricks do not go offline just like that. There will be some genuine issue
> > which triggers this. Could you please attach the entire glusterd.logs and
> > the brick logs around the time so that someone would be able to look?
> >
> > Just to make sure, please check if you have any network outage(using
> iperf
> > or some standard tool).
> >
> > @Hari, i think you forgot to provide the bug link, please provide so that
> > Jan
> > or someone can check if it is related.
> >
> >
> > --
> > Thanks & Regards,
> > Manikandan Selvaganesan.
> > (@Manikandan Selvaganesh on Web)
> >
> > On Fri, Jun 30, 2017 at 3:19 PM, Jan <jan.h.zak at gmail.com> wrote:
> >>
> >> Hi Hari,
> >>
> >> thank you for your support!
> >>
> >> Did I try to check offline bricks multiple times?
> >> Yes – I gave it enough time (at least 20 minutes) to recover but it
> stayed
> >> offline.
> >>
> >> Version?
> >> All nodes are 100% equal – I tried fresh installation several times
> during
> >> my testing, Every time it is CentOS Minimal install with all updates and
> >> without any additional software:
> >>
> >> uname -r
> >> 3.10.0-514.21.2.el7.x86_64
> >>
> >> yum list installed | egrep 'gluster|ganesha'
> >> centos-release-gluster310.noarch     1.0-1.el7.centos         @extras
> >> glusterfs.x86_64                     3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-api.x86_64                 3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-cli.x86_64                 3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-client-xlators.x86_64      3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-fuse.x86_64                3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-ganesha.x86_64             3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-libs.x86_64                3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-server.x86_64              3.10.2-1.el7
> >> @centos-gluster310
> >> libntirpc.x86_64                     1.4.3-1.el7
> >> @centos-gluster310
> >> nfs-ganesha.x86_64                   2.4.5-1.el7
> >> @centos-gluster310
> >> nfs-ganesha-gluster.x86_64           2.4.5-1.el7
> >> @centos-gluster310
> >> userspace-rcu.x86_64                 0.7.16-3.el7
> >> @centos-gluster310
> >>
> >> Grepping for the brick process?
> >> I’ve just tried it again. Process doesn’t exist when brick is offline.
> >>
> >> Force start command?
> >> sudo gluster volume start MyVolume force
> >>
> >> That works! Thank you.
> >>
> >> If I have this issue too often then I can create simple script that
> greps
> >> all bricks on the local server and force start when it’s offline. I can
> >> schedule such script once after for example 5 minutes after boot.
> >>
> >> But I’m not sure if it’s good idea to automate it. I’d be worried that I
> >> can force it up even when the node doesn’t “see” other nodes and cause
> split
> >> brain issue.
> >>
> >> Thank you!
> >>
> >> Kind regards,
> >> Jan
> >>
> >>
> >> On Fri, Jun 30, 2017 at 8:01 AM, Hari Gowtham <hgowtham at redhat.com>
> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>> comments inline.
> >>>
> >>> On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak at gmail.com> wrote:
> >>> > Hi all,
> >>> >
> >>> > Gluster and Ganesha are amazing. Thank you for this great work!
> >>> >
> >>> > I’m struggling with one issue and I think that you might be able to
> >>> > help me.
> >>> >
> >>> > I spent some time by playing with Gluster and Ganesha and after I
> gain
> >>> > some
> >>> > experience I decided that I should go into production but I’m still
> >>> > struggling with one issue.
> >>> >
> >>> > I have 3x node CentOS 7.3 with the most current Gluster and Ganesha
> >>> > from
> >>> > centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.
> >>> >
> >>> > Servers have a lot of resources and they run in a subnet on a stable
> >>> > network.
> >>> >
> >>> > I didn’t have any issues when I tested a single brick. But now I’d
> like
> >>> > to
> >>> > setup 17 replicated bricks and I realized that when I restart one of
> >>> > nodes
> >>> > then the result looks like this:
> >>> >
> >>> > sudo gluster volume status | grep ' N '
> >>> >
> >>> > Brick glunode0:/st/brick3/dir          N/A       N/A        N
>  N/A
> >>> > Brick glunode1:/st/brick2/dir          N/A       N/A        N
>  N/A
> >>> >
> >>>
> >>> did you try it multiple times?
> >>>
> >>> > Some bricks just don’t go online. Sometime it’s one brick, sometime
> >>> > tree and
> >>> > it’s not same brick – it’s random issue.
> >>> >
> >>> > I checked log on affected servers and this is an example:
> >>> >
> >>> > sudo tail /var/log/glusterfs/bricks/st-brick3-0.log
> >>> >
> >>> > [2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv]
> 0-glusterfs:
> >>> > readv on 10.2.44.23:24007 failed (No data available)
> >>> > [2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_
> rpc_notify]
> >>> > 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No
> >>> > data
> >>> > available)
> >>> > [2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_
> rpc_notify]
> >>> > 0-glusterfsd-mgmt: Exhausted all volfile servers
> >>> > [2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit]
> >>> > (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
> >>> > -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5]
> >>> > -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] )
> >>> > 0-:received signum (15), shutting down
> >>> > [2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect]
> >>> > 0-glusterfs:
> >>> > connection attempt on 10.2.44.23:24007 failed, (Network is
> unreachable)
> >>> > [2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request]
> >>> > 0-glusterfs: not connected (priv->connected = 0)
> >>> > [2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit]
> >>> > 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster
> >>> > Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
> >>> >
> >>> > I think that important message is “Network is unreachable”.
> >>> >
> >>> > Question
> >>> > 1. Could you please tell me, is that normal when you have many
> bricks?
> >>> > Networks is definitely stable and other servers use it without
> problem
> >>> > and
> >>> > all servers run on a same pair of switches. My assumption is that in
> >>> > the
> >>> > same time many bricks try to connect and that doesn’t work.
> >>>
> >>> no. it shouldnt happen if there are multiple bricks.
> >>> there was a bug related to this [1]
> >>> to verify if that was the issue I need to know a few things.
> >>> 1) are all the node of the same version.
> >>> 2) did you check grepping for the brick process using the ps command?
> >>> need to verify is the brick is still up and is not connected to
> glusterd
> >>> alone.
> >>>
> >>>
> >>> >
> >>> > 2. Is there an option to configure a brick to enable some kind of
> >>> > autoreconnect or add some timeout?
> >>> > gluster volume set brick123 option456 abc ??
> >>> If the brick process is not seen in the ps aux | grep glusterfsd
> >>> The way to start a brick is to use the volume start force command.
> >>> If brick is not started there is no point configuring it. and to start
> >>> a brick we cant
> >>> use the configure command.
> >>>
> >>> >
> >>> > 3. What it the recommend way to fix offline brick on the affected
> >>> > server? I
> >>> > don’t want to use “gluster volume stop/start” since affected bricks
> are
> >>> > online on other server and there is no reason to completely turn it
> >>> > off.
> >>> gluster volume start force will not bring down the bricks that are
> >>> already up and
> >>> running.
> >>>
> >>> >
> >>> > Thank you,
> >>> > Jan
> >>> >
> >>> > _______________________________________________
> >>> > Gluster-users mailing list
> >>> > Gluster-users at gluster.org
> >>> > http://lists.gluster.org/mailman/listinfo/gluster-users
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Hari Gowtham.
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://lists.gluster.org/mailman/listinfo/gluster-users
> >
> >
>
>
>
> --
> Regards,
> Hari Gowtham.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170702/01fef9be/attachment.html>