[Bugs] [Bug 1231646] New: [glusterd] glusterd crashed while trying to remove a bricks - one selected from each replica set - after shrinking nX3 to nX2 to nX1

Mon Jun 15 07:18:07 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1231646

            Bug ID: 1231646
           Summary: [glusterd] glusterd crashed while trying to remove a
                    bricks - one selected from each replica set - after
                    shrinking nX3 to nX2 to nX1
           Product: GlusterFS
           Version: 3.7.0
         Component: glusterd
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: ggarg at redhat.com
                CC: bugs at gluster.org, ggarg at redhat.com,
                    gluster-bugs at redhat.com, sasundar at redhat.com
        Depends On: 1230101, 1230121

+++ This bug was initially created as a clone of Bug #1230121 +++

+++ This bug was initially created as a clone of Bug #1230101 +++

Description of problem:
=======================

While trying to remove-brick with replica count 2 from the existing
volume(replica 2), glusterd crashes with following bt:

#0  0x00007fcdd03e681c in subvol_matcher_update    (req=0x25989cc)    at
glusterd-brick-ops.c:662
#1  __glusterd_handle_remove_brick (req=0x25989cc) at glusterd-brick-ops.c:985
#2  0x00007fcdd03542bf in glusterd_big_locked_handler (req=0x25989cc,
actor_fn=0x7fcdd03e5f90 <__glusterd_handle_remove_brick>)    at
glusterd-handler.c:83
#3  0x0000003b0d8655b2 in synctask_wrap    (old_task=<value optimized out>) at
syncop.c:375
#4  0x0000003b028438f0 in ?? ()    from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) 

Logs suggest:
=============

[2015-06-10 14:18:01.134630] I
[glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
[2015-06-10 14:18:01.137158] I
[glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
[2015-06-10 14:18:28.239515] I
[glusterd-brick-ops.c:779:__glusterd_handle_remove_brick] 0-management:
Received rem brick req
[2015-06-10 14:18:28.239593] I
[glusterd-brick-ops.c:849:__glusterd_handle_remove_brick] 0-management: request
to change replica-count to 2
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-06-10 14:18:28
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3b0d824b66]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3b0d84359f]
/lib64/libc.so.6[0x3b028326a0]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_remove_brick+0x88c)[0x7fcdd03e681c]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fcdd03542bf]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x3b0d8655b2]
/lib64/libc.so.6[0x3b028438f0]
---------
(END) 

How reproducible:
==================

Always

Steps to Reproduce:
===================
1. Create 2X3 distributed-replicate volume
2. Start the volume
3. Shrink it to 2X2 distributed-replicate volume by explicitly mentioning
replica 2 in 'remove-brick force'
4. Shrink the volume again to 2X1 distribute volume by explicitly mentioning
replica 1 in 'remove-brick force'

Actual results:
===============

Glusterd crash

Expected results:
=================

Removing brick with replica count 2 from replica count 2 is a failure case, it
should print usage or fail gracefully.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-06-10
05:05:33 EDT ---

This bug is automatically being proposed for Red Hat Gluster Storage 3.1.0 by
setting the release flag 'rhgs‑3.1.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Rahul Hinduja on 2015-06-10 05:07:22 EDT ---

[root at georep1 scripts]# gluster volume info

Volume Name: master
Type: Distributed-Replicate
Volume ID: 7156c64c-a44b-40a4-98db-247a06d1f41e
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.96:/rhs/brick1/b1
Brick2: 10.70.46.97:/rhs/brick1/b1
Brick3: 10.70.46.96:/rhs/brick2/b2
Brick4: 10.70.46.97:/rhs/brick2/b2
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
performance.readdir-ahead: on
[root at georep1 scripts]# gluster volume remove-brick master replica 2
10.70.46.97:/rhs/brick1/b1 10.70.46.97:/rhs/brick2/b2 start
Connection failed. Please check if gluster daemon is operational.
[root at georep1 scripts]#

--- Additional comment from Rahul Hinduja on 2015-06-10 05:15:20 EDT ---

sosreport @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1230101/

Additional Info: this volume is part of geo-rep master cluster.

--- Additional comment from SATHEESARAN on 2015-06-10 05:44:58 EDT ---

I have tried to reproduce the issue.

Its reproducible only with the following case :

1. Created 2X3 distributed-replicate volume
2. Shrink it to 2X2 distributed-replicate volume
3. Shrink it to 2X2 to 2X1 distribute volume

Here are few more observations :
1. There is no crash observed when creating a 2X2 volume and shrinking it to
2X1
2. There is no crash observed when creating a 2X3 volume and shrinking it to
2X2
3. There is no crash observed when trying to remove each brick from all replica
sets and proper error message is thrown

--- Additional comment from Anand Avati on 2015-06-10 11:21:02 EDT ---

REVIEW: http://review.gluster.org/11165 (glusterd: subvol_count value for
replicate volume should be calculate correctly) posted (#1) for review on
master by Gaurav Kumar Garg (ggarg at redhat.com)

--- Additional comment from Anand Avati on 2015-06-11 02:48:09 EDT ---

REVIEW: http://review.gluster.org/11165 (glusterd: subvol_count value for
replicate volume should be calculate correctly) posted (#2) for review on
master by Gaurav Kumar Garg (ggarg at redhat.com)

--- Additional comment from Anand Avati on 2015-06-11 08:21:22 EDT ---

REVIEW: http://review.gluster.org/11165 (glusterd: subvol_count value for
replicate volume should be calculate correctly) posted (#4) for review on
master by Gaurav Kumar Garg (ggarg at redhat.com)

--- Additional comment from Anand Avati on 2015-06-15 03:16:09 EDT ---

COMMIT: http://review.gluster.org/11165 committed in master by Krishnan
Parthasarathi (kparthas at redhat.com) 
------
commit 3fb18451311c34aeced1054472b6f81fc13dd679
Author: Gaurav Kumar Garg <ggarg at redhat.com>
Date:   Wed Jun 10 15:11:39 2015 +0530

    glusterd: subvol_count value for replicate volume should be calculate
correctly

    glusterd was crashing while trying to remove bricks from replica set
    after shrinking nx3 replica to nx2 replica to nx1 replica.

    This is because volinfo->subvol_count is calculating value from old
    replica count value.

    Change-Id: I1084a71e29c9cfa1cd85bdb4e82b943b1dc44372
    BUG: 1230121
    Signed-off-by: Gaurav Kumar Garg <ggarg at redhat.com>
    Reviewed-on: http://review.gluster.org/11165
    Reviewed-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-by: Ravishankar N <ravishankar at redhat.com>
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Tested-by: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Krishnan Parthasarathi <kparthas at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1230101
[Bug 1230101] [glusterd] glusterd crashed while trying to remove a bricks -
one selected from each replica set - after shrinking nX3 to nX2 to nX1
https://bugzilla.redhat.com/show_bug.cgi?id=1230121
[Bug 1230121] [glusterd] glusterd crashed while trying to remove a bricks -
one selected from each replica set - after shrinking nX3 to nX2 to nX1
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.