[Bugs] [Bug 1331934] New: glusterd restart is failing if volume brick is down due to underlying FS crash.

Sat Apr 30 06:42:43 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1331934

            Bug ID: 1331934
           Summary: glusterd restart is failing if volume brick is down
                    due to underlying FS crash.
           Product: GlusterFS
           Version: 3.7.11
         Component: glusterd
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: bsrirama at redhat.com, bugs at gluster.org
        Depends On: 1330385, 1330481

+++ This bug was initially created as a clone of Bug #1330481 +++

+++ This bug was initially created as a clone of Bug #1330385 +++

Description of problem:
=======================
glusterd restart is failing if volume brick is down due to underlying filsystem
crash (XFS)

Version-Release number of selected component (if applicable):
============================================================
mainline

How reproducible:
=================
Always

Steps to Reproduce:
===================
1. Have one/two node cluster
2. Create 1*2 volume and start it.
3. crash underlying filesystem for one of the volume brick using "godown tool"
OR any other way.
4. Check brick is down using "volume status"
5. Try glusterd restart //restart will fail.

Actual results:
===============
glusterd restart is failing if volume brick is down due to FS crash.

Expected results:
=================
glusterd restart should work.

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-04-26
01:57:25 EDT ---

This bug is automatically being proposed for the current z-stream release of
Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Byreddy on 2016-04-26 02:13:45 EDT ---

Additional info:
================

[root at dhcp42-82 ~]# gluster volume status
Status of volume: Dis
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.82:/bricks/brick2/br2        49155     0          Y       23291
Brick 10.70.42.82:/bricks/brick1/br1        N/A       N/A        N       N/A  
Brick 10.70.42.82:/bricks/brick2/br3        49154     0          Y       23329
NFS Server on localhost                     2049      0          Y       23354
NFS Server on dhcp43-136.lab.eng.blr.redhat
.com                                        2049      0          Y       8049 

Task Status of Volume Dis
------------------------------------------------------------------------------
There are no active volume tasks

[root at dhcp42-82 ~]# 
[root at dhcp42-82 ~]# systemctl restart glusterd
Job for glusterd.service failed because the control process exited with error
code. See "systemctl status glusterd.service" and "journalctl -xe" for details.
[root at dhcp42-82 ~]# 

glusterd logs:
=============

pid --log-level INFO)
[2016-04-26 06:08:47.439960] I [MSGID: 106478] [glusterd.c:1337:init]
0-management: Maximum allowed open file descriptors set to 65536
[2016-04-26 06:08:47.440044] I [MSGID: 106479] [glusterd.c:1386:init]
0-management: Using /var/lib/glusterd as working directory
[2016-04-26 06:08:47.453605] W [MSGID: 103071]
[rdma.c:4594:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel
creation failed [No such device]
[2016-04-26 06:08:47.453658] W [MSGID: 103055] [rdma.c:4901:init]
0-rdma.management: Failed to initialize IB Device
[2016-04-26 06:08:47.453677] W [rpc-transport.c:359:rpc_transport_load]
0-rpc-transport: 'rdma' initialization failed
[2016-04-26 06:08:47.453885] W [rpcsvc.c:1597:rpcsvc_transport_create]
0-rpc-service: cannot create listener, initing the transport failed
[2016-04-26 06:08:47.453924] E [MSGID: 106243] [glusterd.c:1610:init]
0-management: creation of 1 listeners failed, continuing with succeeded
transport
[2016-04-26 06:08:52.512606] I [MSGID: 106513]
[glusterd-store.c:2065:glusterd_restore_op_version] 0-glusterd: retrieved
op-version: 30712
[2016-04-26 06:08:53.671078] I [MSGID: 106544]
[glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID:
eac322e5-ef82-47db-b88b-2449c0164482
[2016-04-26 06:08:53.671466] C [MSGID: 106425]
[glusterd-store.c:2434:glusterd_store_retrieve_bricks] 0-management: realpath()
failed for brick /bricks/brick1/br1. The underlying file system may be in bad
state [Input/output error]
[2016-04-26 06:08:53.671847] E [MSGID: 106201]
[glusterd-store.c:3092:glusterd_store_retrieve_volumes] 0-management: Unable to
restore volume: Dis
[2016-04-26 06:08:53.671888] E [MSGID: 101019] [xlator.c:433:xlator_init]
0-management: Initialization of volume 'management' failed, review your volfile
again
[2016-04-26 06:08:53.671900] E [graph.c:322:glusterfs_graph_init] 0-management:
initializing translator failed
[2016-04-26 06:08:53.671907] E [graph.c:661:glusterfs_graph_activate] 0-graph:
init failed
[2016-04-26 06:08:53.672475] W [glusterfsd.c:1251:cleanup_and_exit]
(-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7fe6e9e2b2ad]
-->/usr/sbin/glusterd(glusterfs_process_volfp+0x120) [0x7fe6e9e2b150]
-->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7fe6e9e2a739] ) 0-: received
signum (0), shutting down
(END)

--- Additional comment from Vijay Bellur on 2016-04-26 06:26:29 EDT ---

REVIEW: http://review.gluster.org/14075 (glusterd: glusterd should restart on a
underlying file system crash) posted (#1) for review on master by Atin
Mukherjee (amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-26 13:27:58 EDT ---

REVIEW: http://review.gluster.org/14075 (glusterd: glusterd should restart on a
underlying file system crash) posted (#2) for review on master by Atin
Mukherjee (amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-29 02:47:44 EDT ---

REVIEW: http://review.gluster.org/14075 (glusterd: persist
brickinfo->real_path) posted (#3) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Vijay Bellur on 2016-04-29 12:17:26 EDT ---

COMMIT: http://review.gluster.org/14075 committed in master by Jeff Darcy
(jdarcy at redhat.com) 
------
commit f0fb05d2cefae08c143f2bfdef151084f5ddb498
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Tue Apr 26 15:27:43 2016 +0530

    glusterd: persist brickinfo->real_path

    Since real_path was not persisted and gets constructed at every glusterd
    restart, glusterd will fail to come up if one of the brick's underlying
file
    system is crashed.

    Solution is to construct real_path only once and get it persisted.

    Change-Id: I97abc30372c1ffbbb2d43b716d7af09172147b47
    BUG: 1330481
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: http://review.gluster.org/14075
    CentOS-regression: Gluster Build System <jenkins at build.gluster.com>
    Smoke: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Kaushal M <kaushal at redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1330385
[Bug 1330385] glusterd restart is failing if volume brick is down due to
underlying FS crash.
https://bugzilla.redhat.com/show_bug.cgi?id=1330481
[Bug 1330481] glusterd restart is failing if volume brick is down due to
underlying FS crash.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.