[Gluster-users] [Gluster-devel] Gluster Brick Offline after reboot!!
ABHISHEK PALIWAL
abhishpaliwal at gmail.com
Wed May 4 12:48:05 UTC 2016
I am talking about the time taken by the GlusterD to mark the process
offline because
here GlusterD is responsible to making brick online/offline.
is it configurable?
On Wed, May 4, 2016 at 5:53 PM, Atin Mukherjee <amukherj at redhat.com> wrote:
> Abhishek,
>
> See the response inline.
>
>
> On 05/04/2016 05:43 PM, ABHISHEK PALIWAL wrote:
> > Hi Atin,
> >
> > please reply, is there any configurable time out parameter for brick
> > process to go offline which we can increase?
> >
> > Regards,
> > Abhishek
> >
> > On Thu, Apr 21, 2016 at 12:34 PM, ABHISHEK PALIWAL
> > <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>> wrote:
> >
> > Hi Atin,
> >
> > Please answer following doubts as well:
> >
> > 1 .If there is a temporary glitch in the network , will that affect
> > the gluster brick process in anyway, Is there any timeout for the
> > brick process to go offline in case of the glitch in the network.
> If there is disconnection, GlusterD will receive it and mark the
> brick as disconnected even if the brick process is online. So answer to
> this question is both yes and no. From process perspective they are
> still up but not to the other components/layers and that may impact the
> operations (both mgmt & I/O given there is a disconnect between client
> and brick processes too)
> >
> > 2. Is there is any configurable time out parameter which we can
> > increase ?
> I don't get this question. What time out are you talking about?
> >
> > 3.Brick and glusterd connected by unix domain socket.It is just a
> > local socket then why it is disconnect in below logs:
> This is not true, its over TCP socket.
> >
> > 1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
> > [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:
> > Brick 10.32. 1.144:/opt/lvmdir/c2/brick has disconnected from
> > glusterd.
> > 1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
> > [glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd: Setting
> > brick 10.32.1. 144:/opt/lvmdir/c2/brick status to stopped
> >
> > Regards,
> > Abhishek
> >
> >
> > On Tue, Apr 19, 2016 at 1:12 PM, ABHISHEK PALIWAL
> > <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>> wrote:
> >
> > Hi Atin,
> >
> > Thanks.
> >
> > Have more doubts here.
> >
> > Brick and glusterd connected by unix domain socket.It is just a
> > local socket then why it is disconnect in below logs:
> >
> > 1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
> > [glusterd-handler.c:4908:__glusterd_brick_rpc_notify]
> 0-management:
> > Brick 10.32. 1.144:/opt/lvmdir/c2/brick has disconnected
> from
> > glusterd.
> > 1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
> > [glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd:
> > Setting
> > brick 10.32.1. 144:/opt/lvmdir/c2/brick status to stopped
> >
> >
> > Regards,
> > Abhishek
> >
> >
> > On Fri, Apr 15, 2016 at 9:14 AM, Atin Mukherjee
> > <amukherj at redhat.com <mailto:amukherj at redhat.com>> wrote:
> >
> >
> >
> > On 04/14/2016 04:07 PM, ABHISHEK PALIWAL wrote:
> > >
> > >
> > > On Thu, Apr 14, 2016 at 2:33 PM, Atin Mukherjee <
> amukherj at redhat.com <mailto:amukherj at redhat.com>
> > > <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>
> wrote:
> > >
> > >
> > >
> > > On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:
> > > >
> > > >
> > > > On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee <
> amukherj at redhat.com <mailto:amukherj at redhat.com>
> > <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
> > > > <mailto:amukherj at redhat.com
> > <mailto:amukherj at redhat.com> <mailto:amukherj at redhat.com
> > <mailto:amukherj at redhat.com>>>> wrote:
> > > >
> > > >
> > > >
> > > > On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote:
> > > > > Hi Team,
> > > > >
> > > > > We are using Gluster 3.7.6 and facing one
> > problem in which
> > > brick is not
> > > > > comming online after restart the board.
> > > > >
> > > > > To understand our setup, please look the
> > following steps:
> > > > > 1. We have two boards A and B on which Gluster
> > volume is
> > > running in
> > > > > replicated mode having one brick on each board.
> > > > > 2. Gluster mount point is present on the Board
> > A which is
> > > sharable
> > > > > between number of processes.
> > > > > 3. Till now our volume is in sync and
> > everthing is working fine.
> > > > > 4. Now we have test case in which we'll stop
> > the glusterd,
> > > reboot the
> > > > > Board B and when this board comes up, starts
> > the glusterd
> > > again on it.
> > > > > 5. We repeated Steps 4 multiple times to check
> the
> > > reliability of system.
> > > > > 6. After the Step 4, sometimes system comes in
> > working state
> > > (i.e. in
> > > > > sync) but sometime we faces that brick of
> > Board B is present in
> > > > > “gluster volume status” command but not be
> > online even
> > > waiting for
> > > > > more than a minute.
> > > > As I mentioned in another email thread until and
> > unless the
> > > log shows
> > > > the evidence that there was a reboot nothing can
> > be concluded.
> > > The last
> > > > log what you shared with us few days back didn't
> > give any
> > > indication
> > > > that brick process wasn't running.
> > > >
> > > > How can we identify that the brick process is
> > running in brick logs?
> > > >
> > > > > 7. When the Step 4 is executing at the same
> > time on Board A some
> > > > > processes are started accessing the files from
> > the Gluster
> > > mount point.
> > > > >
> > > > > As a solution to make this brick online, we
> > found some
> > > existing issues
> > > > > in gluster mailing list giving suggestion to
> > use “gluster
> > > volume start
> > > > > <vol_name> force” to make the brick 'offline'
> > to 'online'.
> > > > >
> > > > > If we use “gluster volume start <vol_name>
> > force” command.
> > > It will kill
> > > > > the existing volume process and started the
> > new process then
> > > what will
> > > > > happen if other processes are accessing the
> > same volume at
> > > the time when
> > > > > volume process is killed by this command
> > internally. Will it
> > > impact any
> > > > > failure on these processes?
> > > > This is not true, volume start force will start
> > the brick
> > > processes only
> > > > if they are not running. Running brick processes
> > will not be
> > > > interrupted.
> > > >
> > > > we have tried and check the pid of process before
> > force start and
> > > after
> > > > force start.
> > > > the pid has been changed after force start.
> > > >
> > > > Please find the logs at the time of failure attached
> > once again with
> > > > log-level=debug.
> > > >
> > > > if you can give me the exact line where you are able
> > to find out that
> > > > the brick process
> > > > is running in brick log file please give me the line
> > number of
> > > that file.
> > >
> > > Here is the sequence at which glusterd and respective
> > brick process is
> > > restarted.
> > >
> > > 1. glusterd restart trigger - line number 1014 in
> > glusterd.log file:
> > >
> > > [2016-04-03 10:12:29.051735] I [MSGID: 100030]
> > [glusterfsd.c:2318:main]
> > > 0-/usr/sbin/glusterd: Started running /usr/sbin/
> > glusterd
> > > version 3.7.6 (args: /usr/sbin/glusterd -p
> > /var/run/glusterd.pid
> > > --log-level DEBUG)
> > >
> > > 2. brick start trigger - line number 190 in
> > opt-lvmdir-c2-brick.log
> > >
> > > [2016-04-03 10:14:25.268833] I [MSGID: 100030]
> > [glusterfsd.c:2318:main]
> > > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/
> > glusterfsd
> > > version 3.7.6 (args: /usr/sbin/glusterfsd -s
> > 10.32.1.144 --volfile-id
> > > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> > >
> >
> system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> > > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
> > socket
> > > --brick-name /opt/lvmdir/c2/brick -l
> > > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
> > --xlator-option
> > > *-posix.glusterd-
> > uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> > > --brick-port 49329 --xlator-option
> > c_glusterfs-server.listen-port=49329)
> > >
> > > 3. The following log indicates that brick is up and is
> > now started.
> > > Refer to line 16123 in glusterd.log
> > >
> > > [2016-04-03 10:14:25.336855] D [MSGID: 0]
> > > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
> > 0-management:
> > > Connected to 10.32.1.144:/opt/lvmdir/c2/brick
> > >
> > > This clearly indicates that the brick is up and
> > running as after that I
> > > do not see any disconnect event been processed by
> > glusterd for the brick
> > > process.
> > >
> > >
> > > Thanks for replying descriptively but please also clear
> > some more doubts:
> > >
> > > 1. At this 10:14:25 moment of time brick is available
> > because we have
> > > removed brick and added it again to make it online:
> > > following are the logs from cmd-history.log file of 000300
> > >
> > > [2016-04-03 10:14:21.446570] : volume status : SUCCESS
> > > [2016-04-03 10:14:21.665889] : volume remove-brick
> > c_glusterfs replica
> > > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> > > [2016-04-03 10:14:21.764270] : peer detach 10.32.1.144 :
> > SUCCESS
> > > [2016-04-03 10:14:23.060442] : peer probe 10.32.1.144 :
> > SUCCESS
> > > [2016-04-03 10:14:25.649525] : volume add-brick
> > c_glusterfs replica 2
> > > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> > >
> > > and also 10:12:29 was the last reboot time before this
> > failure. So I am
> > > totally agree what you said earlier.
> > >
> > > 2 .As you said at 10:12:29 glusterd restarted then why we
> > are not
> > > getting 'brick start trigger' related logs
> > > like below between 10:12:29 to 10:14:25 time stamp which
> > is something
> > > two minute of time interval.
> > So here is the culprit:
> >
> > 1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
> > [glusterd-handler.c:4908:__glusterd_brick_rpc_notify]
> > 0-management:
> > Brick 10.32. 1.144:/opt/lvmdir/c2/brick has
> > disconnected from
> > glusterd.
> > 1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
> > [glusterd-utils.c:4872:glusterd_set_brick_status]
> > 0-glusterd: Setting
> > brick 10.32.1. 144:/opt/lvmdir/c2/brick status to
> stopped
> >
> >
> > GlusterD received a disconnect event for this brick process
> > and mark it
> > as stopped. This could happen due to two reasons. 1. brick
> > process goes
> > down or 2. Network issue. In this case its the later I
> > believe since the
> > brick process was running at that time. I'd request you to
> > check this
> > from the N/W side.
> >
> >
> > >
> > > [2016-04-03 10:14:25.268833] I [MSGID: 100030]
> > [glusterfsd.c:2318:main]
> > > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/
> > glusterfsd
> > > version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144
> > --volfile-id
> > > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> > >
> >
> system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> > > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
> > socket
> > > --brick-name /opt/lvmdir/c2/brick -l
> > > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
> > --xlator-option
> > > *-posix.glusterd-
> > uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> > > --brick-port 49329 --xlator-option
> > c_glusterfs-server.listen-port=49329)
> > >
> > > 3. We are continuously checking brick status in the above
> > time duration
> > > using "gluster volume status" refer the cmd-history.log
> > file from 000300
> > >
> > > In glusterd.log file we are also getting below logs
> > >
> > > [2016-04-03 10:12:31.771051] D [MSGID: 0]
> > > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
> > 0-management:
> > > Connected to 10.32.1.144:/opt/lvmdir/c2/brick
> > >
> > > [2016-04-03 10:12:32.981152] D [MSGID: 0]
> > > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
> > 0-management:
> > > Connected to 10.32.1.144:/opt/lvmdir/c2/brick
> > >
> > > two times b/w 10:12:29 and 10:14:25 and as you said these
> > logs "
> > > clearly indicates that the brick is up and running as
> > after" then why
> > > brick is not online in "gluster volume status" command
> > >
> > > [2016-04-03 10:12:33.990487] : volume status : SUCCESS
> > > [2016-04-03 10:12:34.007469] : volume status : SUCCESS
> > > [2016-04-03 10:12:35.095918] : volume status : SUCCESS
> > > [2016-04-03 10:12:35.126369] : volume status : SUCCESS
> > > [2016-04-03 10:12:36.224018] : volume status : SUCCESS
> > > [2016-04-03 10:12:36.251032] : volume status : SUCCESS
> > > [2016-04-03 10:12:37.352377] : volume status : SUCCESS
> > > [2016-04-03 10:12:37.374028] : volume status : SUCCESS
> > > [2016-04-03 10:12:38.446148] : volume status : SUCCESS
> > > [2016-04-03 10:12:38.468860] : volume status : SUCCESS
> > > [2016-04-03 10:12:39.534017] : volume status : SUCCESS
> > > [2016-04-03 10:12:39.553711] : volume status : SUCCESS
> > > [2016-04-03 10:12:40.616610] : volume status : SUCCESS
> > > [2016-04-03 10:12:40.636354] : volume status : SUCCESS
> > > ......
> > > ......
> > > ......
> > > [2016-04-03 10:14:21.446570] : volume status : SUCCESS
> > > [2016-04-03 10:14:21.665889] : volume remove-brick
> > c_glusterfs replica
> > > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> > > [2016-04-03 10:14:21.764270] : peer detach 10.32.1.144 :
> > SUCCESS
> > > [2016-04-03 10:14:23.060442] : peer probe 10.32.1.144 :
> > SUCCESS
> > > [2016-04-03 10:14:25.649525] : volume add-brick
> > c_glusterfs replica 2
> > > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> > >
> > > In above logs we are continuously checking brick status
> > but when we
> > > don't find brick status 'online' even after ~2 minutes
> > then we removed
> > > it and add it again to make it online.
> > >
> > > [2016-04-03 10:14:21.665889] : volume remove-brick
> > c_glusterfs replica
> > > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> > > [2016-04-03 10:14:21.764270] : peer detach 10.32.1.144 :
> > SUCCESS
> > > [2016-04-03 10:14:23.060442] : peer probe 10.32.1.144 :
> > SUCCESS
> > > [2016-04-03 10:14:25.649525] : volume add-brick
> > c_glusterfs replica 2
> > > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
> > >
> > > that is why in logs we are gettting "brick start trigger
> > logs" at time
> > > stamp 10:14:25
> > >
> > > [2016-04-03 10:14:25.268833] I [MSGID: 100030]
> > [glusterfsd.c:2318:main]
> > > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/
> > glusterfsd
> > > version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144
> > --volfile-id
> > > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> > >
> >
> system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> > > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
> > socket
> > > --brick-name /opt/lvmdir/c2/brick -l
> > > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
> > --xlator-option
> > > *-posix.glusterd-
> > uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> > > --brick-port 49329 --xlator-option
> > c_glusterfs-server.listen-port=49329)
> > >
> > >
> > > Regards,
> > > Abhishek
> > >
> > >
> > > Please note that all the logs referred and pasted are
> > from 002500.
> > >
> > > ~Atin
> > > >
> > > > 002500 - Board B that brick is offline
> > > > 00300 - Board A logs
> > > >
> > > > >
> > > > > *Question : What could be contributing to
> > brick offline?*
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Regards
> > > > > Abhishek Paliwal
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Gluster-devel mailing list
> > > > > Gluster-devel at gluster.org
> > <mailto:Gluster-devel at gluster.org>
> > <mailto:Gluster-devel at gluster.org
> > <mailto:Gluster-devel at gluster.org>>
> > > <mailto:Gluster-devel at gluster.org
> > <mailto:Gluster-devel at gluster.org>
> > <mailto:Gluster-devel at gluster.org
> > <mailto:Gluster-devel at gluster.org>>>
> > > > >
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > --
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> >
> >
> > --
> >
> >
> >
> >
> > Regards
> > Abhishek Paliwal
>
--
Regards
Abhishek Paliwal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160504/62806220/attachment.html>
More information about the Gluster-users
mailing list