[Gluster-devel] Gluster Brick Offline after reboot!!

Thu Apr 14 10:37:45 UTC 2016

On Thu, Apr 14, 2016 at 2:33 PM, Atin Mukherjee <amukherj at redhat.com> wrote:

>
>
> On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:
> >
> >
> > On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee <amukherj at redhat.com
> > <mailto:amukherj at redhat.com>> wrote:
> >
> >
> >
> >     On 04/05/2016 01:04 PM, ABHISHEK PALIWAL wrote:
> >     > Hi Team,
> >     >
> >     > We are using Gluster 3.7.6 and facing one problem in which brick
> is not
> >     > comming online after restart the board.
> >     >
> >     > To understand our setup, please look the following steps:
> >     > 1. We have two boards A and B on which Gluster volume is running in
> >     > replicated mode having one brick on each board.
> >     > 2. Gluster mount point is present on the Board A which is sharable
> >     > between number of processes.
> >     > 3. Till now our volume is in sync and everthing is working fine.
> >     > 4. Now we have test case in which we'll stop the glusterd, reboot
> the
> >     > Board B and when this board comes up, starts the glusterd again on
> it.
> >     > 5. We repeated Steps 4 multiple times to check the reliability of
> system.
> >     > 6. After the Step 4, sometimes system comes in working state (i.e.
> in
> >     > sync) but sometime we faces that brick of Board B is present in
> >     >     “gluster volume status” command but not be online even waiting
> for
> >     > more than a minute.
> >     As I mentioned in another email thread until and unless the log shows
> >     the evidence that there was a reboot nothing can be concluded. The
> last
> >     log what you shared with us few days back didn't give any indication
> >     that brick process wasn't running.
> >
> > How can we identify that the brick process is running in brick logs?
> >
> >     > 7. When the Step 4 is executing at the same time on Board A some
> >     > processes are started accessing the files from the Gluster mount
> point.
> >     >
> >     > As a solution to make this brick online, we found some existing
> issues
> >     > in gluster mailing list giving suggestion to use “gluster volume
> start
> >     > <vol_name> force” to make the brick 'offline' to 'online'.
> >     >
> >     > If we use “gluster volume start <vol_name> force” command. It will
> kill
> >     > the existing volume process and started the new process then what
> will
> >     > happen if other processes are accessing the same volume at the
> time when
> >     > volume process is killed by this command internally. Will it
> impact any
> >     > failure on these processes?
> >     This is not true, volume start force will start the brick processes
> only
> >     if they are not running. Running brick processes will not be
> >     interrupted.
> >
> > we have tried and check the pid of process before force start and after
> > force start.
> > the pid has been changed after force start.
> >
> > Please find the logs at the time of failure attached once again with
> > log-level=debug.
> >
> > if you can give me the exact line where you are able to find out that
> > the brick process
> > is running in brick log file please give me the line number of that file.
>
> Here is the sequence at which glusterd and respective brick process is
> restarted.
>
> 1. glusterd restart trigger - line number 1014 in glusterd.log file:
>
> [2016-04-03 10:12:29.051735] I [MSGID: 100030] [glusterfsd.c:2318:main]
> 0-/usr/sbin/glusterd: Started running /usr/sbin/              glusterd
> version 3.7.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid
> --log-level DEBUG)
>
> 2. brick start trigger - line number 190 in opt-lvmdir-c2-brick.log
>
> [2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]
> 0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd
> version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id
> c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
> system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
> -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket
> --brick-name /opt/lvmdir/c2/brick -l
> /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option
> *-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
> --brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)
>
> 3. The following log indicates that brick is up and is now started.
> Refer to line 16123 in glusterd.log
>
> [2016-04-03 10:14:25.336855] D [MSGID: 0]
> [glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:
> Connected to 10.32.1.144:/opt/lvmdir/c2/brick
>
> This clearly indicates that the brick is up and running as after that I
> do not see any disconnect event been processed by glusterd for the brick
> process.
>

Thanks for replying descriptively but please also clear some more doubts:

1. At this 10:14:25 moment of time brick is available because we have
removed brick and added it again to make it online:
following are the logs from cmd-history.log file of 000300

[2016-04-03 10:14:21.446570]  : volume status : SUCCESS
[2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica 1
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
[2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS
[2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS
[2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS

and also 10:12:29 was the last reboot time before this failure. So I am
totally agree what you said earlier.

2 .As you said at 10:12:29 glusterd restarted then why we are not getting
'brick start trigger' related logs
 like below between 10:12:29 to 10:14:25 time stamp which is something two
minute of time interval.

[2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd
version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id
c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
-S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket
--brick-name /opt/lvmdir/c2/brick -l
/var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option
*-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
--brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)

3. We are continuously checking brick status in the above time duration
using  "gluster volume status" refer the cmd-history.log file from 000300

In glusterd.log file we are also getting below logs

[2016-04-03 10:12:31.771051] D [MSGID: 0]
[glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:
Connected to 10.32.1.144:/opt/lvmdir/c2/brick

[2016-04-03 10:12:32.981152] D [MSGID: 0]
[glusterd-handler.c:4897:__glusterd_brick_rpc_notify] 0-management:
Connected to 10.32.1.144:/opt/lvmdir/c2/brick

two times b/w 10:12:29 and 10:14:25 and as you said these logs  " clearly
indicates that the brick is up and running as after" then why brick is not
online in "gluster volume status" command

[2016-04-03 10:12:33.990487]  : volume status : SUCCESS
[2016-04-03 10:12:34.007469]  : volume status : SUCCESS
[2016-04-03 10:12:35.095918]  : volume status : SUCCESS
[2016-04-03 10:12:35.126369]  : volume status : SUCCESS
[2016-04-03 10:12:36.224018]  : volume status : SUCCESS
[2016-04-03 10:12:36.251032]  : volume status : SUCCESS
[2016-04-03 10:12:37.352377]  : volume status : SUCCESS
[2016-04-03 10:12:37.374028]  : volume status : SUCCESS
[2016-04-03 10:12:38.446148]  : volume status : SUCCESS
[2016-04-03 10:12:38.468860]  : volume status : SUCCESS
[2016-04-03 10:12:39.534017]  : volume status : SUCCESS
[2016-04-03 10:12:39.553711]  : volume status : SUCCESS
[2016-04-03 10:12:40.616610]  : volume status : SUCCESS
[2016-04-03 10:12:40.636354]  : volume status : SUCCESS
......
......
......
[2016-04-03 10:14:21.446570]  : volume status : SUCCESS
[2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica 1
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
[2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS
[2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS
[2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS

In above logs we are continuously checking brick status but when we don't
find brick status 'online' even after ~2 minutes then we removed it and add
it again to make it online.

[2016-04-03 10:14:21.665889]  : volume remove-brick c_glusterfs replica 1
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
[2016-04-03 10:14:21.764270]  : peer detach 10.32.1.144 : SUCCESS
[2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 : SUCCESS
[2016-04-03 10:14:25.649525]  : volume add-brick c_glusterfs replica 2
10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS

that is why in logs we are gettting "brick start trigger logs" at time
stamp 10:14:25

[2016-04-03 10:14:25.268833] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/            glusterfsd
version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144 --volfile-id
c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
-S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.        socket
--brick-name /opt/lvmdir/c2/brick -l
/var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log --xlator-option
*-posix.glusterd-       uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
--brick-port 49329 --xlator-option c_glusterfs-server.listen-port=49329)

Regards,
Abhishek

> Please note that all the logs referred and pasted are from 002500.
>
> ~Atin
> >
> > 002500 - Board B that brick is offline
> > 00300 - Board A logs
> >
> >     >
> >     > *Question : What could be contributing to brick offline?*
> >     >
> >     >
> >     > --
> >     >
> >     > Regards
> >     > Abhishek Paliwal
> >     >
> >     >
> >     > _______________________________________________
> >     > Gluster-devel mailing list
> >     > Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >     > http://www.gluster.org/mailman/listinfo/gluster-devel
> >     >
> >
> >
> >
> >
>

-- 

Regards
Abhishek Paliwal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160414/32081178/attachment-0001.html>