[Bugs] [Bug 1323564] New: [scale] Brick process does not start after node reboot

Mon Apr 4 05:57:04 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1323564

            Bug ID: 1323564
           Summary: [scale] Brick process does not start after node reboot
           Product: GlusterFS
           Version: 3.7.10
         Component: glusterd
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: bugs at gluster.org, nerawat at redhat.com,
                    sasundar at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1322306, 1322805

+++ This bug was initially created as a clone of Bug #1322805 +++

+++ This bug was initially created as a clone of Bug #1322306 +++

Description of problem:
Brick process does not start automatically after reboot of a node.

Failed to start bricks using start force. It worked after running it 2-3 times.

Setup:

4 node cluster running 120 dist-rep [2 x 2] volumes.

I have seen this issue with only 3 volumes, vol4, vol31, vol89. Though there
are other volumes with bricks running on same node.

Version-Release number of selected component (if applicable):
3.7.5-19

How reproducible:
Not always, haven't faced this issue till 100 volume. 

Steps to Reproduce:
1. Reboot a Gluster node
2. Check the status of gluster volume 

Actual results:
Brick process is not running

Expected results:
Brick process should come up after node reboot

Additional info:
Will attach sosreport and setup details

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-03-30
06:38:39 EDT ---

This bug is automatically being proposed for the current z-stream release of
Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from SATHEESARAN on 2016-03-30 08:14:52 EDT ---

Neha,

Could you attach sosreports from both the nodes for analysis ?

--- Additional comment from Neha on 2016-03-31 01:00:51 EDT ---

Atin has already checked the setup. After reboot other process is consuming the
port assigned to brick process. 

Brick logs:

[2016-03-30 12:21:33.014717] I [MSGID: 100030] [glusterfsd.c:2318:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.7.5
(args: /usr/sbin/glusterfsd -s 10.70.36.3 --volfile-id
vol6.10.70.36.3.var-lib-heketi-mounts-vg_1d43e80bda3b78bff5f2bd4e788c5bb3-brick_24db93a1ba7c2021d30db0bb7523f1d4-brick
-p
/var/lib/glusterd/vols/vol6/run/10.70.36.3-var-lib-heketi-mounts-vg_1d43e80bda3b78bff5f2bd4e788c5bb3-brick_24db93a1ba7c2021d30db0bb7523f1d4-brick.pid
-S /var/run/gluster/4d936da2aa9f690e51cb86f5cf49740a.socket --brick-name
/var/lib/heketi/mounts/vg_1d43e80bda3b78bff5f2bd4e788c5bb3/brick_24db93a1ba7c2021d30db0bb7523f1d4/brick
-l
/var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_1d43e80bda3b78bff5f2bd4e788c5bb3-brick_24db93a1ba7c2021d30db0bb7523f1d4-brick.log
--xlator-option *-posix.glusterd-uuid=b5b78ebd-94f4-4a96-a9ba-6621e730a411
--brick-port 49161 --xlator-option vol6-server.listen-port=49161)
[2016-03-30 12:21:33.022880] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2016-03-30 12:21:48.059423] I [graph.c:269:gf_add_cmdline_options]
0-vol6-server: adding option 'listen-port' for volume 'vol6-server' with value
'49161'
[2016-03-30 12:21:48.059447] I [graph.c:269:gf_add_cmdline_options]
0-vol6-posix: adding option 'glusterd-uuid' for volume 'vol6-posix' with value
'b5b78ebd-94f4-4a96-a9ba-6621e730a411'
[2016-03-30 12:21:48.059643] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 2
[2016-03-30 12:21:48.059666] I [MSGID: 115034]
[server.c:403:_check_for_auth_option]
0-/var/lib/heketi/mounts/vg_1d43e80bda3b78bff5f2bd4e788c5bb3/brick_24db93a1ba7c2021d30db0bb7523f1d4/brick:
skip format check for non-addr auth option
auth.login./var/lib/heketi/mounts/vg_1d43e80bda3b78bff5f2bd4e788c5bb3/brick_24db93a1ba7c2021d30db0bb7523f1d4/brick.allow
[2016-03-30 12:21:48.059687] I [MSGID: 115034]
[server.c:403:_check_for_auth_option]
0-/var/lib/heketi/mounts/vg_1d43e80bda3b78bff5f2bd4e788c5bb3/brick_24db93a1ba7c2021d30db0bb7523f1d4/brick:
skip format check for non-addr auth option
auth.login.11c450ad-efc3-49a8-952f-68f8b37eb539.password
[2016-03-30 12:21:48.060695] I [rpcsvc.c:2215:rpcsvc_set_outstanding_rpc_limit]
0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2016-03-30 12:21:48.060762] W [MSGID: 101002] [options.c:957:xl_opt_validate]
0-vol6-server: option 'listen-port' is deprecated, preferred is
'transport.socket.listen-port', continuing with correction
[2016-03-30 12:21:48.060850] E [socket.c:769:__socket_server_bind]
0-tcp.vol6-server: binding to  failed: Address already in use
[2016-03-30 12:21:48.060861] E [socket.c:772:__socket_server_bind]
0-tcp.vol6-server: Port is already in use
[2016-03-30 12:21:48.060871] W [rpcsvc.c:1604:rpcsvc_transport_create]
0-rpc-service: listening on transport failed
[2016-03-30 12:21:48.060877] W [MSGID: 115045] [server.c:1060:init]
0-vol6-server: creation of listener failed
[2016-03-30 12:21:48.060892] E [MSGID: 101019] [xlator.c:428:xlator_init]
0-vol6-server: Initialization of volume 'vol6-server' failed, review your
volfile again
[2016-03-30 12:21:48.060898] E [graph.c:322:glusterfs_graph_init]
0-vol6-server: initializing translator failed
[2016-03-30 12:21:48.060902] E [graph.c:661:glusterfs_graph_activate] 0-graph:
init failed
[2016-03-30 12:21:48.061387] W [glusterfsd.c:1236:cleanup_and_exit]
(-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x331) [0x7f07cc09e801]
-->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x126) [0x7f07cc0991a6]
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x69) [0x7f07cc098789] ) 0-: received
signum (0), shutting down

--- Additional comment from Vijay Bellur on 2016-03-31 07:15:11 EDT ---

REVIEW: http://review.gluster.org/13865 (glusterd: Do not persist brickinfo
ports) posted (#2) for review on master by Atin Mukherjee (amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-03-31 08:56:58 EDT ---

REVIEW: http://review.gluster.org/13865 (glusterd: Allocate fresh port on brick
(re)start) posted (#3) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-01 00:33:04 EDT ---

REVIEW: http://review.gluster.org/13865 (glusterd: Allocate fresh port on brick
(re)start) posted (#4) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-01 02:37:02 EDT ---

REVIEW: http://review.gluster.org/13865 (glusterd: Allocate fresh port on brick
(re)start) posted (#5) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-01 03:05:09 EDT ---

REVIEW: http://review.gluster.org/13865 (glusterd: Allocate fresh port on brick
(re)start) posted (#6) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-01 16:39:01 EDT ---

COMMIT: http://review.gluster.org/13865 committed in master by Jeff Darcy
(jdarcy at redhat.com) 
------
commit 34899d71f21fd2b4c523b68ffb2d7c655c776641
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Thu Mar 31 11:01:53 2016 +0530

    glusterd: Allocate fresh port on brick (re)start

    There is no point of using the same port through the entire volume life
cycle
    for a particular bricks process since there is no guarantee that the same
port
    would be free and no other application wouldn't consume it in between the
    glusterd/volume restart.

    We hit a race where on glusterd restart the daemon services start followed
by
    brick processes and the time brick process tries to bind with the port
which was
    allocated by glusterd before a restart is been already consumed by some
other
    client like NFS/SHD/...

    Note : This is a short term solution as here we reduce the race window but
don't
    eliminate it completely. As a long term solution the port allocation has to
be
    done by glusterfsd and the same should be communicated back to glusterd for
book
    keeping

    Change-Id: Ibbd1e7ca87e51a7cd9cf216b1fe58ef7783aef24
    BUG: 1322805
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: http://review.gluster.org/13865
    Smoke: Gluster Build System <jenkins at build.gluster.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Jeff Darcy <jdarcy at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1322306
[Bug 1322306] [scale] Brick process does not start after node reboot
https://bugzilla.redhat.com/show_bug.cgi?id=1322805
[Bug 1322805] [scale] Brick process does not start after node reboot
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.