[Gluster-devel] Gluster Brick Offline after reboot!!

Atin Mukherjee amukherj at redhat.com
Wed May 4 12:57:39 UTC 2016



On 05/04/2016 06:18 PM, ABHISHEK PALIWAL wrote:
> I am talking about the time taken by the GlusterD to mark the process
> offline because
> here GlusterD is responsible to making brick online/offline.
> 
> is it configurable?
No, there is no such configuration
> 
> On Wed, May 4, 2016 at 5:53 PM, Atin Mukherjee <amukherj at redhat.com
> <mailto:amukherj at redhat.com>> wrote:
> 
>     Abhishek,
> 
>     See the response inline.
> 
> 
>     On 05/04/2016 05:43 PM, ABHISHEK PALIWAL wrote:
>     > Hi Atin,
>     >
>     > please reply, is there any configurable time out parameter for brick
>     > process to go offline which we can increase?
>     >
>     > Regards,
>     > Abhishek
>     >
>     > On Thu, Apr 21, 2016 at 12:34 PM, ABHISHEK PALIWAL
>     > <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>
>     <mailto:abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>>>
>     wrote:
>     >
>     >     Hi Atin,
>     >
>     >     Please answer following doubts as well:
>     >
>     >     1 .If there is a temporary glitch in the network , will that affect
>     >     the gluster brick process in anyway, Is there any timeout for the
>     >     brick process to go offline in case of the glitch in the network.
>           If there is disconnection, GlusterD will receive it and mark the
>     brick as disconnected even if the brick process is online. So answer to
>     this question is both yes and no. From process perspective they are
>     still up but not to the other components/layers and that may impact the
>     operations (both mgmt & I/O given there is a disconnect between client
>     and brick processes too)
>     >
>     >     2. Is there is any configurable time out parameter which we can
>     >     increase ?
>     I don't get this question. What time out are you talking about?
>     >
>     >     3.Brick and glusterd connected by unix domain socket.It is just a
>     >     local socket then why it is disconnect in below logs:
>           This is not true, its over TCP socket.
>     >
>     >      1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
>     >     [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:
>     >     Brick 10.32.       1.144:/opt/lvmdir/c2/brick has disconnected from
>     >     glusterd.
>     >      1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
>     >     [glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd: Setting
>     >     brick 10.32.1.        144:/opt/lvmdir/c2/brick status to stopped
>     >
>     >     Regards,
>     >     Abhishek
>     >
>     >
>     >     On Tue, Apr 19, 2016 at 1:12 PM, ABHISHEK PALIWAL
>     >     <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>
>     <mailto:abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>>>
>     wrote:
>     >
>     >         Hi Atin,
>     >
>     >         Thanks.
>     >
>     >         Have more doubts here.
>     >
>     >         Brick and glusterd connected by unix domain socket.It is just a
>     >         local socket then why it is disconnect in below logs:
>     >
>     >          1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
>     >         [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management:
>     >         Brick 10.32.       1.144:/opt/lvmdir/c2/brick has disconnected from
>     >         glusterd.
>     >          1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
>     >         [glusterd-utils.c:4872:glusterd_set_brick_status] 0-glusterd:
>     >         Setting
>     >         brick 10.32.1.        144:/opt/lvmdir/c2/brick status to stopped
>     >
>     >
>     >         Regards,
>     >         Abhishek
>     >
>     >
>     >         On Fri, Apr 15, 2016 at 9:14 AM, Atin Mukherjee
>     >         <amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>> wrote:
>     >
>     >
>     >
>     >             On 04/14/2016 04:07 PM, ABHISHEK PALIWAL wrote:
>     >             >
>     >             >
>     >             > On Thu, Apr 14, 2016 at 2:33 PM, Atin Mukherjee <amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     >             > <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com> <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>>>> wrote:
>     >             >
>     >             >
>     >             >
>     >             >     On 04/05/2016 03:35 PM, ABHISHEK PALIWAL wrote:
>     >             >     >
>     >             >     >
>     >             >     > On Tue, Apr 5, 2016 at 2:22 PM, Atin Mukherjee
>     <amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     >             <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com> <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>>>
>     >             >     > <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>
>     >             <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>> <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>
>     >             <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>>>>> wrote:
>     >             >     >
>     >             >     >
>     >             >     >
>     >             >     >     On 04/05/2016 01:04 PM, ABHISHEK PALIWAL
>     wrote:
>     >             >     >     > Hi Team,
>     >             >     >     >
>     >             >     >     > We are using Gluster 3.7.6 and facing one
>     >             problem in which
>     >             >     brick is not
>     >             >     >     > comming online after restart the board.
>     >             >     >     >
>     >             >     >     > To understand our setup, please look the
>     >             following steps:
>     >             >     >     > 1. We have two boards A and B on which
>     Gluster
>     >             volume is
>     >             >     running in
>     >             >     >     > replicated mode having one brick on each
>     board.
>     >             >     >     > 2. Gluster mount point is present on the
>     Board
>     >             A which is
>     >             >     sharable
>     >             >     >     > between number of processes.
>     >             >     >     > 3. Till now our volume is in sync and
>     >             everthing is working fine.
>     >             >     >     > 4. Now we have test case in which we'll stop
>     >             the glusterd,
>     >             >     reboot the
>     >             >     >     > Board B and when this board comes up, starts
>     >             the glusterd
>     >             >     again on it.
>     >             >     >     > 5. We repeated Steps 4 multiple times to
>     check the
>     >             >     reliability of system.
>     >             >     >     > 6. After the Step 4, sometimes system
>     comes in
>     >             working state
>     >             >     (i.e. in
>     >             >     >     > sync) but sometime we faces that brick of
>     >             Board B is present in
>     >             >     >     >     “gluster volume status” command but
>     not be
>     >             online even
>     >             >     waiting for
>     >             >     >     > more than a minute.
>     >             >     >     As I mentioned in another email thread
>     until and
>     >             unless the
>     >             >     log shows
>     >             >     >     the evidence that there was a reboot
>     nothing can
>     >             be concluded.
>     >             >     The last
>     >             >     >     log what you shared with us few days back
>     didn't
>     >             give any
>     >             >     indication
>     >             >     >     that brick process wasn't running.
>     >             >     >
>     >             >     > How can we identify that the brick process is
>     >             running in brick logs?
>     >             >     >
>     >             >     >     > 7. When the Step 4 is executing at the same
>     >             time on Board A some
>     >             >     >     > processes are started accessing the
>     files from
>     >             the Gluster
>     >             >     mount point.
>     >             >     >     >
>     >             >     >     > As a solution to make this brick online, we
>     >             found some
>     >             >     existing issues
>     >             >     >     > in gluster mailing list giving suggestion to
>     >             use “gluster
>     >             >     volume start
>     >             >     >     > <vol_name> force” to make the brick
>     'offline'
>     >             to 'online'.
>     >             >     >     >
>     >             >     >     > If we use “gluster volume start <vol_name>
>     >             force” command.
>     >             >     It will kill
>     >             >     >     > the existing volume process and started the
>     >             new process then
>     >             >     what will
>     >             >     >     > happen if other processes are accessing the
>     >             same volume at
>     >             >     the time when
>     >             >     >     > volume process is killed by this command
>     >             internally. Will it
>     >             >     impact any
>     >             >     >     > failure on these processes?
>     >             >     >     This is not true, volume start force will
>     start
>     >             the brick
>     >             >     processes only
>     >             >     >     if they are not running. Running brick
>     processes
>     >             will not be
>     >             >     >     interrupted.
>     >             >     >
>     >             >     > we have tried and check the pid of process before
>     >             force start and
>     >             >     after
>     >             >     > force start.
>     >             >     > the pid has been changed after force start.
>     >             >     >
>     >             >     > Please find the logs at the time of failure
>     attached
>     >             once again with
>     >             >     > log-level=debug.
>     >             >     >
>     >             >     > if you can give me the exact line where you
>     are able
>     >             to find out that
>     >             >     > the brick process
>     >             >     > is running in brick log file please give me
>     the line
>     >             number of
>     >             >     that file.
>     >             >
>     >             >     Here is the sequence at which glusterd and
>     respective
>     >             brick process is
>     >             >     restarted.
>     >             >
>     >             >     1. glusterd restart trigger - line number 1014 in
>     >             glusterd.log file:
>     >             >
>     >             >     [2016-04-03 10:12:29.051735] I [MSGID: 100030]
>     >             [glusterfsd.c:2318:main]
>     >             >     0-/usr/sbin/glusterd: Started running /usr/sbin/
>     >                     glusterd
>     >             >     version 3.7.6 (args: /usr/sbin/glusterd -p
>     >             /var/run/glusterd.pid
>     >             >     --log-level DEBUG)
>     >             >
>     >             >     2. brick start trigger - line number 190 in
>     >             opt-lvmdir-c2-brick.log
>     >             >
>     >             >     [2016-04-03 10:14:25.268833] I [MSGID: 100030]
>     >             [glusterfsd.c:2318:main]
>     >             >     0-/usr/sbin/glusterfsd: Started running /usr/sbin/
>     >                     glusterfsd
>     >             >     version 3.7.6 (args: /usr/sbin/glusterfsd -s
>     >             10.32.1.144 --volfile-id
>     >             >     c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
>     >             >
>     >             
>     system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
>     >             >     -S
>     /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
>     >                   socket
>     >             >     --brick-name /opt/lvmdir/c2/brick -l
>     >             >     /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
>     >             --xlator-option
>     >             >     *-posix.glusterd-
>     >              uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
>     >             >     --brick-port 49329 --xlator-option
>     >             c_glusterfs-server.listen-port=49329)
>     >             >
>     >             >     3. The following log indicates that brick is up
>     and is
>     >             now started.
>     >             >     Refer to line 16123 in glusterd.log
>     >             >
>     >             >     [2016-04-03 10:14:25.336855] D [MSGID: 0]
>     >             >   
>      [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
>     >             0-management:
>     >             >     Connected to 10.32.1.144:/opt/lvmdir/c2/brick
>     >             >
>     >             >     This clearly indicates that the brick is up and
>     >             running as after that I
>     >             >     do not see any disconnect event been processed by
>     >             glusterd for the brick
>     >             >     process.
>     >             >
>     >             >
>     >             > Thanks for replying descriptively but please also clear
>     >             some more doubts:
>     >             >
>     >             > 1. At this 10:14:25 moment of time brick is available
>     >             because we have
>     >             > removed brick and added it again to make it online:
>     >             > following are the logs from cmd-history.log file of
>     000300
>     >             >
>     >             > [2016-04-03 10:14:21.446570]  : volume status : SUCCESS
>     >             > [2016-04-03 10:14:21.665889]  : volume remove-brick
>     >             c_glusterfs replica
>     >             > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
>     >             > [2016-04-03 10:14:21.764270]  : peer detach
>     10.32.1.144 :
>     >             SUCCESS
>     >             > [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 :
>     >             SUCCESS
>     >             > [2016-04-03 10:14:25.649525]  : volume add-brick
>     >             c_glusterfs replica 2
>     >             > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
>     >             >
>     >             > and also 10:12:29 was the last reboot time before this
>     >             failure. So I am
>     >             > totally agree what you said earlier.
>     >             >
>     >             > 2 .As you said at 10:12:29 glusterd restarted then
>     why we
>     >             are not
>     >             > getting 'brick start trigger' related logs
>     >             >  like below between 10:12:29 to 10:14:25 time stamp
>     which
>     >             is something
>     >             > two minute of time interval.
>     >             So here is the culprit:
>     >
>     >              1667 [2016-04-03 10:12:32.984331] I [MSGID: 106005]
>     >             [glusterd-handler.c:4908:__glusterd_brick_rpc_notify]
>     >             0-management:
>     >             Brick 10.32.       1.144:/opt/lvmdir/c2/brick has
>     >             disconnected from
>     >             glusterd.
>     >              1668 [2016-04-03 10:12:32.984366] D [MSGID: 0]
>     >             [glusterd-utils.c:4872:glusterd_set_brick_status]
>     >             0-glusterd: Setting
>     >             brick 10.32.1.        144:/opt/lvmdir/c2/brick status
>     to stopped
>     >
>     >
>     >             GlusterD received a disconnect event for this brick
>     process
>     >             and mark it
>     >             as stopped. This could happen due to two reasons. 1. brick
>     >             process goes
>     >             down or 2. Network issue. In this case its the later I
>     >             believe since the
>     >             brick process was running at that time. I'd request you to
>     >             check this
>     >             from the N/W side.
>     >
>     >
>     >             >
>     >             > [2016-04-03 10:14:25.268833] I [MSGID: 100030]
>     >             [glusterfsd.c:2318:main]
>     >             > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/
>     >                 glusterfsd
>     >             > version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144
>     >             --volfile-id
>     >             > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
>     >             >
>     >           
>      system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
>     >             > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
>     >               socket
>     >             > --brick-name /opt/lvmdir/c2/brick -l
>     >             > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
>     >             --xlator-option
>     >             > *-posix.glusterd-
>     >              uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
>     >             > --brick-port 49329 --xlator-option
>     >             c_glusterfs-server.listen-port=49329)
>     >             >
>     >             > 3. We are continuously checking brick status in the
>     above
>     >             time duration
>     >             > using  "gluster volume status" refer the cmd-history.log
>     >             file from 000300
>     >             >
>     >             > In glusterd.log file we are also getting below logs
>     >             >
>     >             > [2016-04-03 10:12:31.771051] D [MSGID: 0]
>     >             > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
>     >             0-management:
>     >             > Connected to 10.32.1.144:/opt/lvmdir/c2/brick
>     >             >
>     >             > [2016-04-03 10:12:32.981152] D [MSGID: 0]
>     >             > [glusterd-handler.c:4897:__glusterd_brick_rpc_notify]
>     >             0-management:
>     >             > Connected to 10.32.1.144:/opt/lvmdir/c2/brick
>     >             >
>     >             > two times b/w 10:12:29 and 10:14:25 and as you said
>     these
>     >             logs  "
>     >             > clearly indicates that the brick is up and running as
>     >             after" then why
>     >             > brick is not online in "gluster volume status" command
>     >             >
>     >             > [2016-04-03 10:12:33.990487]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:34.007469]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:35.095918]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:35.126369]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:36.224018]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:36.251032]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:37.352377]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:37.374028]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:38.446148]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:38.468860]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:39.534017]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:39.553711]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:40.616610]  : volume status : SUCCESS
>     >             > [2016-04-03 10:12:40.636354]  : volume status : SUCCESS
>     >             > ......
>     >             > ......
>     >             > ......
>     >             > [2016-04-03 10:14:21.446570]  : volume status : SUCCESS
>     >             > [2016-04-03 10:14:21.665889]  : volume remove-brick
>     >             c_glusterfs replica
>     >             > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
>     >             > [2016-04-03 10:14:21.764270]  : peer detach
>     10.32.1.144 :
>     >             SUCCESS
>     >             > [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 :
>     >             SUCCESS
>     >             > [2016-04-03 10:14:25.649525]  : volume add-brick
>     >             c_glusterfs replica 2
>     >             > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
>     >             >
>     >             > In above logs we are continuously checking brick status
>     >             but when we
>     >             > don't find brick status 'online' even after ~2 minutes
>     >             then we removed
>     >             > it and add it again to make it online.
>     >             >
>     >             > [2016-04-03 10:14:21.665889]  : volume remove-brick
>     >             c_glusterfs replica
>     >             > 1 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
>     >             > [2016-04-03 10:14:21.764270]  : peer detach
>     10.32.1.144 :
>     >             SUCCESS
>     >             > [2016-04-03 10:14:23.060442]  : peer probe 10.32.1.144 :
>     >             SUCCESS
>     >             > [2016-04-03 10:14:25.649525]  : volume add-brick
>     >             c_glusterfs replica 2
>     >             > 10.32.1.144:/opt/lvmdir/c2/brick force : SUCCESS
>     >             >
>     >             > that is why in logs we are gettting "brick start trigger
>     >             logs" at time
>     >             > stamp 10:14:25
>     >             >
>     >             > [2016-04-03 10:14:25.268833] I [MSGID: 100030]
>     >             [glusterfsd.c:2318:main]
>     >             > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/
>     >                 glusterfsd
>     >             > version 3.7.6 (args: /usr/sbin/glusterfsd -s 10.32.1.144
>     >             --volfile-id
>     >             > c_glusterfs.10.32.1.144.opt-lvmdir-c2-brick -p /
>     >             >
>     >           
>      system/glusterd/vols/c_glusterfs/run/10.32.1.144-opt-lvmdir-c2-brick.pid
>     >             > -S /var/run/gluster/697c0e4a16ebc734cd06fd9150723005.
>     >               socket
>     >             > --brick-name /opt/lvmdir/c2/brick -l
>     >             > /var/log/glusterfs/bricks/opt-lvmdir-c2-brick.log
>     >             --xlator-option
>     >             > *-posix.glusterd-
>     >              uuid=2d576ff8-0cea-4f75-9e34-a5674fbf7256
>     >             > --brick-port 49329 --xlator-option
>     >             c_glusterfs-server.listen-port=49329)
>     >             >
>     >             >
>     >             > Regards,
>     >             > Abhishek
>     >             >
>     >             >
>     >             >     Please note that all the logs referred and
>     pasted are
>     >             from 002500.
>     >             >
>     >             >     ~Atin
>     >             >     >
>     >             >     > 002500 - Board B that brick is offline
>     >             >     > 00300 - Board A logs
>     >             >     >
>     >             >     >     >
>     >             >     >     > *Question : What could be contributing to
>     >             brick offline?*
>     >             >     >     >
>     >             >     >     >
>     >             >     >     > --
>     >             >     >     >
>     >             >     >     > Regards
>     >             >     >     > Abhishek Paliwal
>     >             >     >     >
>     >             >     >     >
>     >             >     >     >
>     _______________________________________________
>     >             >     >     > Gluster-devel mailing list
>     >             >     >     > Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>
>     >             <mailto:Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>>
>     >             <mailto:Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>
>     >             <mailto:Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>>>
>     >             >     <mailto:Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>
>     >             <mailto:Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>>
>     >             <mailto:Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>
>     >             <mailto:Gluster-devel at gluster.org
>     <mailto:Gluster-devel at gluster.org>>>>
>     >             >     >     >
>     >             http://www.gluster.org/mailman/listinfo/gluster-devel
>     >             >     >     >
>     >             >     >
>     >             >     >
>     >             >     >
>     >             >     >
>     >             >
>     >             >
>     >             >
>     >             >
>     >             > --
>     >             >
>     >             >
>     >             >
>     >             >
>     >
>     >
>     >
>     >
>     >
>     >
>     > --
>     >
>     >
>     >
>     >
>     > Regards
>     > Abhishek Paliwal
> 
> 
> 
> 
> -- 
> 
> 
> 
> 
> Regards
> Abhishek Paliwal


More information about the Gluster-devel mailing list