[Gluster-devel] Gluster Brick Offline after reboot!!

ABHISHEK PALIWAL abhishpaliwal at gmail.com
Wed May 4 12:48:05 UTC 2016


I am talking about the time taken by the GlusterD to mark the process
offline because
here GlusterD is responsible to making brick online/offline.

is it configurable?

On Wed, May 4, 2016 at 5:53 PM, Atin Mukherjee <amukherj at redhat.com> wrote:

> Abhishek,
>
> See the response inline.
>
>
> On 05/04/2016 05:43 PM, ABHISHEK PALIWAL wrote:
> > Hi Atin,
> >
> > please reply, is there any configurable time out parameter for brick
> > process to go offline which we can increase?
> >
> > Regards,
> > Abhishek
> >
> > On Thu, Apr 21, 2016 at 12:34 PM, ABHISHEK PALIWAL
> > <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>> wrote:
> >
> >     Hi Atin,
> >
> >     Please answer following doubts as well:
> >
> >     1 .If there is a temporary glitch in the network , will that affect
> >     the gluster brick process in anyway, Is there any timeout for the
> >     brick process to go offline in case of the glitch in the network.
>       If there is disconnection, GlusterD will receive it and mark the
> brick as disconnected even if the brick process is online. So answer to
> this question is both yes and no. From process perspective they are
> still up but not to the other components/layers and that may impact the
> operations (both mgmt & I/O given there is a disconnect between client
> and brick processes too)
> >
> >     2. Is there is any configurable time out parameter which we can
> >     increase ?
> I don't get this question. What time out are you talking about?
> >
> >     3.Brick and glusterd connected by unix domain socket.It is just a
> >     local socket then why it is disconnect in below logs:
>       This is not true, its over TCP socket.
> >
> >      1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
> >     [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:
> >     Brick 10.32.       1.144:/opt/lvmdir/c2/brick has disconnected from
> >     glusterd.
> >      1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
> >     [glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd: Setting
> >     brick 10.32.1.        144:/opt/lvmdir/c2/brick status to stopped
> >
> >     Regards,
> >     Abhishek
> >
> >
> >     On Tue, Apr 19, 2016 at 1:12 PM, ABHISHEK PALIWAL
> >     <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>> wrote:
> >
> >         Hi Atin,
> >
> >         Thanks.
> >
> >         Have more doubts here.
> >
> >         Brick and glusterd connected by unix domain socket.It is just a
> >         local socket then why it is disconnect in below logs:
> >
> >          1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
> >         [glusterd-handler.c:4908:__glusterd_brick_rpc_notify]
> 0-management:
> >         Brick 10.32.       1.144:/opt/lvmdir/c2/brick has disconnected
> from
> >         glusterd.
> >          1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
> >         [glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd:
> >         Setting
> >         brick 10.32.1.        144:/opt/lvmdir/c2/brick status to stopped
> >
> >
> >         Regards,
> >         Abhishek
> >
> >
> >         On Fri, Apr 15, 2016 at 9:14 AM, Atin Mukherjee
> >         <amukherj at redhat.com <mailto:amukherj at redhat.com>> wrote:
> >
> >
> >
> >             On 04/14/2016 04:07 PM, ABHISHEK PALIWAL wrote:
> >             >
> >             >
> >             > On Thu, Apr 14, 2016 at 2:33 PM, Atin Mukherjee <
> amukherj at redhat.com <mailto:amukherj at redhat.com>
> >             > <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>
> wrote:
> >             >
> >             >
> >             >
> >             >     On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:
> >             >     >
> >             >     >
> >             >     > On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee <
> amukherj at redhat.com <mailto:amukherj at redhat.com>
> >             <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
> >             >     > <mailto:amukherj at redhat.com
> >             <mailto:amukherj at redhat.com> <mailto:amukherj at redhat.com
> >             <mailto:amukherj at redhat.com>>>> wrote:
> >             >     >
> >             >     >
> >             >     >
> >             >     >     On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote:
> >             >     >     > Hi Team,
> >             >     >     >
> >             >     >     > We are using Gluster 3.7.6 and facing one
> >             problem in which
> >             >     brick is not
> >             >     >     > comming online after restart the board.
> >             >     >     >
> >             >     >     > To understand our setup, please look the
> >             following steps:
> >             >     >     > 1. We have two boards A and B on which Gluster
> >             volume is
> >             >     running in
> >             >     >     > replicated mode having one brick on each board.
> >             >     >     > 2. Gluster mount point is present on the Board
> >             A which is
> >             >     sharable
> >             >     >     > between number of processes.
> >             >     >     > 3. Till now our volume is in sync and
> >             everthing is working fine.
> >             >     >     > 4. Now we have test case in which we'll stop
> >             the glusterd,
> >             >     reboot the
> >             >     >     > Board B and when this board comes up, starts
> >             the glusterd
> >             >     again on it.
> >             >     >     > 5. We repeated Steps 4 multiple times to check
> the
> >             >     reliability of system.
> >             >     >     > 6. After the Step 4, sometimes system comes in
> >             working state
> >             >     (i.e. in
> >             >     >     > sync) but sometime we faces that brick of
> >             Board B is present in
> >             >     >     >     “gluster volume status” command but not be
> >             online even
> >             >     waiting for
> >             >     >     > more than a minute.
> >             >     >     As I mentioned in another email thread until and
> >             unless the
> >             >     log shows
> >             >     >     the evidence that there was a reboot nothing can
> >             be concluded.
> >             >     The last
> >             >     >     log what you shared with us few days back didn't
> >             give any
> >             >     indication
> >             >     >     that brick process wasn't running.
> >             >     >
> >             >     > How can we identify that the brick process is
> >             running in brick logs?
> >             >     >
> >             >     >     > 7. When the Step 4 is executing at the same
> >             time on Board A some
> >             >     >     > processes are started accessing the files from
> >             the Gluster
> >             >     mount point.
> >             >     >     >
> >             >     >     > As a solution to make this brick online, we
> >             found some
> >             >     existing issues
> >             >     >     > in gluster mailing list giving suggestion to
> >             use “gluster
> >             >     volume start
> >             >     >     > <vol_name> force” to make the brick 'offline'
> >             to 'online'.
> >             >     >     >
> >             >     >     > If we use “gluster volume start <vol_name>
> >             force” command.
> >             >     It will kill
> >             >     >     > the existing volume process and started the
> >             new process then
> >             >     what will
> >             >     >     > happen if other processes are accessing the
> >             same volume at
> >             >     the time when
> >             >     >     > volume process is killed by this command
> >             internally. Will it
> >             >     impact any
> >             >     >     > failure on these processes?
> >             >     >     This is not true, volume start force will start
> >             the brick
> >             >     processes only
> >             >     >     if they are not running. Running brick processes
> >             will not be
> >             >     >     interrupted.
> >             >     >
> >             >     > we have tried and check the pid of process before
> >             force start and
> >             >     after
> >             >     > force start.
> >             >     > the pid has been changed after force start.
> >             >     >
> >             >     > Please find the logs at the time of failure attached
> >             once again with
> >             >     > log-level=debug.
> >             >     >
> >             >     > if you can give me the exact line where you are able
> >             to find out that
> >             >     > the brick process
> >             >     > is running in brick log file please give me the line
> >             number of
> >             >     that file.
> >             >
> >             >     Here is the sequence at which glusterd and respective
> >             brick process is
> >             >     restarted.
> >             >
> >             >     1. glusterd restart trigger - line number 1014 in
> >             glusterd.log file:
> >             >
> >             >     [2016-04-03 10:12:29.051735] I [MSGID: 100030]
> >             [glusterfsd.c:2318:main]
> >             >     0-/usr/sbin/glusterd: Started running /usr/sbin/
> >                     glusterd
> >             >     version 3.7.6 (args: /usr/sbin/glusterd -p
> >             /var/run/glusterd.pid
> >             >     --log-level DEBUG)
> >             >
> >             >     2. brick start trigger - line number 190 in
> >             opt-lvmdir-c2-brick.log
> >             >
> >             >     [2016-04-03 10:14:25.268833] I [MSGID: 100030]
> >             [glusterfsd.c:2318:main]
> >             >     0-/usr/sbin/glusterfsd: Started running /usr/sbin/
> >                     glusterfsd
> >             >     version 3.7.6 (args: /usr/sbin/glusterfsd -s
> >             10.32.1.144 --volfile-id
> >             >     c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> >             >
> >
> system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> >             >     -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
> >                   socket
> >             >     --brick-name /opt/lvmdir/c2/brick -l
> >             >     /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
> >             --xlator-option
> >             >     *-posix.glusterd-
> >              uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> >             >     --brick-port 49329 --xlator-option
> >             c_glusterfs-server.listen-port=49329)
> >             >
> >             >     3. The following log indicates that brick is up and is
> >             now started.
> >             >     Refer to line 16123 in glusterd.log
> >             >
> >             >     [2016-04-03 10:14:25.336855] D [MSGID: 0]
> >             >     [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
> >             0-management:
> >             >     Connected to 10.32.1.144:/opt/lvmdir/c2/brick
> >             >
> >             >     This clearly indicates that the brick is up and
> >             running as after that I
> >             >     do not see any disconnect event been processed by
> >             glusterd for the brick
> >             >     process.
> >             >
> >             >
> >             > Thanks for replying descriptively but please also clear
> >             some more doubts:
> >             >
> >             > 1. At this 10:14:25 moment of time brick is available
> >             because we have
> >             > removed brick and added it again to make it online:
> >             > following are the logs from cmd-history.log file of 000300
> >             >
> >             > [2016-04-03 10:14:21.446570]  : volume status : SUCCESS
> >             > [2016-04-03 10:14:21.665889]  : volume remove-brick
> >             c_glusterfs replica
> >             > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> >             > [2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 :
> >             SUCCESS
> >             > [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 :
> >             SUCCESS
> >             > [2016-04-03 10:14:25.649525]  : volume add-brick
> >             c_glusterfs replica 2
> >             > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> >             >
> >             > and also 10:12:29 was the last reboot time before this
> >             failure. So I am
> >             > totally agree what you said earlier.
> >             >
> >             > 2 .As you said at 10:12:29 glusterd restarted then why we
> >             are not
> >             > getting 'brick start trigger' related logs
> >             >  like below between 10:12:29 to 10:14:25 time stamp which
> >             is something
> >             > two minute of time interval.
> >             So here is the culprit:
> >
> >              1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
> >             [glusterd-handler.c:4908:__glusterd_brick_rpc_notify]
> >             0-management:
> >             Brick 10.32.       1.144:/opt/lvmdir/c2/brick has
> >             disconnected from
> >             glusterd.
> >              1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
> >             [glusterd-utils.c:4872:glusterd_set_brick_status]
> >             0-glusterd: Setting
> >             brick 10.32.1.        144:/opt/lvmdir/c2/brick status to
> stopped
> >
> >
> >             GlusterD received a disconnect event for this brick process
> >             and mark it
> >             as stopped. This could happen due to two reasons. 1. brick
> >             process goes
> >             down or 2. Network issue. In this case its the later I
> >             believe since the
> >             brick process was running at that time. I'd request you to
> >             check this
> >             from the N/W side.
> >
> >
> >             >
> >             > [2016-04-03 10:14:25.268833] I [MSGID: 100030]
> >             [glusterfsd.c:2318:main]
> >             > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/
> >                 glusterfsd
> >             > version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144
> >             --volfile-id
> >             > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> >             >
> >
>  system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> >             > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
> >               socket
> >             > --brick-name /opt/lvmdir/c2/brick -l
> >             > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
> >             --xlator-option
> >             > *-posix.glusterd-
> >              uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> >             > --brick-port 49329 --xlator-option
> >             c_glusterfs-server.listen-port=49329)
> >             >
> >             > 3. We are continuously checking brick status in the above
> >             time duration
> >             > using  "gluster volume status" refer the cmd-history.log
> >             file from 000300
> >             >
> >             > In glusterd.log file we are also getting below logs
> >             >
> >             > [2016-04-03 10:12:31.771051] D [MSGID: 0]
> >             > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
> >             0-management:
> >             > Connected to 10.32.1.144:/opt/lvmdir/c2/brick
> >             >
> >             > [2016-04-03 10:12:32.981152] D [MSGID: 0]
> >             > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
> >             0-management:
> >             > Connected to 10.32.1.144:/opt/lvmdir/c2/brick
> >             >
> >             > two times b/w 10:12:29 and 10:14:25 and as you said these
> >             logs  "
> >             > clearly indicates that the brick is up and running as
> >             after" then why
> >             > brick is not online in "gluster volume status" command
> >             >
> >             > [2016-04-03 10:12:33.990487]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:34.007469]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:35.095918]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:35.126369]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:36.224018]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:36.251032]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:37.352377]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:37.374028]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:38.446148]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:38.468860]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:39.534017]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:39.553711]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:40.616610]  : volume status : SUCCESS
> >             > [2016-04-03 10:12:40.636354]  : volume status : SUCCESS
> >             > ......
> >             > ......
> >             > ......
> >             > [2016-04-03 10:14:21.446570]  : volume status : SUCCESS
> >             > [2016-04-03 10:14:21.665889]  : volume remove-brick
> >             c_glusterfs replica
> >             > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> >             > [2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 :
> >             SUCCESS
> >             > [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 :
> >             SUCCESS
> >             > [2016-04-03 10:14:25.649525]  : volume add-brick
> >             c_glusterfs replica 2
> >             > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> >             >
> >             > In above logs we are continuously checking brick status
> >             but when we
> >             > don't find brick status 'online' even after ~2 minutes
> >             then we removed
> >             > it and add it again to make it online.
> >             >
> >             > [2016-04-03 10:14:21.665889]  : volume remove-brick
> >             c_glusterfs replica
> >             > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> >             > [2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 :
> >             SUCCESS
> >             > [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 :
> >             SUCCESS
> >             > [2016-04-03 10:14:25.649525]  : volume add-brick
> >             c_glusterfs replica 2
> >             > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> >             >
> >             > that is why in logs we are gettting "brick start trigger
> >             logs" at time
> >             > stamp 10:14:25
> >             >
> >             > [2016-04-03 10:14:25.268833] I [MSGID: 100030]
> >             [glusterfsd.c:2318:main]
> >             > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/
> >                 glusterfsd
> >             > version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144
> >             --volfile-id
> >             > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> >             >
> >
>  system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> >             > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
> >               socket
> >             > --brick-name /opt/lvmdir/c2/brick -l
> >             > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
> >             --xlator-option
> >             > *-posix.glusterd-
> >              uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> >             > --brick-port 49329 --xlator-option
> >             c_glusterfs-server.listen-port=49329)
> >             >
> >             >
> >             > Regards,
> >             > Abhishek
> >             >
> >             >
> >             >     Please note that all the logs referred and pasted are
> >             from 002500.
> >             >
> >             >     ~Atin
> >             >     >
> >             >     > 002500 - Board B that brick is offline
> >             >     > 00300 - Board A logs
> >             >     >
> >             >     >     >
> >             >     >     > *Question : What could be contributing to
> >             brick offline?*
> >             >     >     >
> >             >     >     >
> >             >     >     > --
> >             >     >     >
> >             >     >     > Regards
> >             >     >     > Abhishek Paliwal
> >             >     >     >
> >             >     >     >
> >             >     >     > _______________________________________________
> >             >     >     > Gluster-devel mailing list
> >             >     >     > Gluster-devel at gluster.org
> >             <mailto:Gluster-devel at gluster.org>
> >             <mailto:Gluster-devel at gluster.org
> >             <mailto:Gluster-devel at gluster.org>>
> >             >     <mailto:Gluster-devel at gluster.org
> >             <mailto:Gluster-devel at gluster.org>
> >             <mailto:Gluster-devel at gluster.org
> >             <mailto:Gluster-devel at gluster.org>>>
> >             >     >     >
> >             http://www.gluster.org/mailman/listinfo/gluster-devel
> >             >     >     >
> >             >     >
> >             >     >
> >             >     >
> >             >     >
> >             >
> >             >
> >             >
> >             >
> >             > --
> >             >
> >             >
> >             >
> >             >
> >
> >
> >
> >
> >
> >
> > --
> >
> >
> >
> >
> > Regards
> > Abhishek Paliwal
>



-- 




Regards
Abhishek Paliwal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160504/62806220/attachment-0001.html>


More information about the Gluster-devel mailing list