[Bugs] [Bug 1230121] New: [glusterd] glusterd crashed while trying to remove a bricks - one selected from each replica set - after shrinking nX3 to nX2 to nX1

Wed Jun 10 09:48:25 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1230121

            Bug ID: 1230121
           Summary: [glusterd] glusterd crashed while trying to remove a
                    bricks - one selected from each replica set - after
                    shrinking nX3 to nX2 to nX1
           Product: GlusterFS
           Version: 3.7.0
         Component: glusterd
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: sasundar at redhat.com
                CC: bugs at gluster.org, ggarg at redhat.com,
                    gluster-bugs at redhat.com
        Depends On: 1230101

+++ This bug was initially created as a clone of Bug #1230101 +++

Description of problem:
=======================

While trying to remove-brick with replica count 2 from the existing
volume(replica 2), glusterd crashes with following bt:

#0  0x00007fcdd03e681c in subvol_matcher_update    (req=0x25989cc)    at
glusterd-brick-ops.c:662
#1  __glusterd_handle_remove_brick (req=0x25989cc) at glusterd-brick-ops.c:985
#2  0x00007fcdd03542bf in glusterd_big_locked_handler (req=0x25989cc,
actor_fn=0x7fcdd03e5f90 <__glusterd_handle_remove_brick>)    at
glusterd-handler.c:83
#3  0x0000003b0d8655b2 in synctask_wrap    (old_task=<value optimized out>) at
syncop.c:375
#4  0x0000003b028438f0 in ?? ()    from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) 

Logs suggest:
=============

[2015-06-10 14:18:01.134630] I
[glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
[2015-06-10 14:18:01.137158] I
[glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received
get vol req
[2015-06-10 14:18:28.239515] I
[glusterd-brick-ops.c:779:__glusterd_handle_remove_brick] 0-management:
Received rem brick req
[2015-06-10 14:18:28.239593] I
[glusterd-brick-ops.c:849:__glusterd_handle_remove_brick] 0-management: request
to change replica-count to 2
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-06-10 14:18:28
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3b0d824b66]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3b0d84359f]
/lib64/libc.so.6[0x3b028326a0]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_remove_brick+0x88c)[0x7fcdd03e681c]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fcdd03542bf]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x3b0d8655b2]
/lib64/libc.so.6[0x3b028438f0]
---------
(END) 

How reproducible:
==================

Always

Steps to Reproduce:
===================
1. Create 2X3 distributed-replicate volume
2. Start the volume
3. Shrink it to 2X2 distributed-replicate volume by explicitly mentioning
replica 2 in 'remove-brick force'
4. Shrink the volume again to 2X1 distribute volume by explicitly mentioning
replica 1 in 'remove-brick force'

Actual results:
===============

Glusterd crash

Expected results:
=================

Removing brick with replica count 2 from replica count 2 is a failure case, it
should print usage or fail gracefully.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-06-10
05:05:33 EDT ---

This bug is automatically being proposed for Red Hat Gluster Storage 3.1.0 by
setting the release flag 'rhgs‑3.1.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Rahul Hinduja on 2015-06-10 05:07:22 EDT ---

[root at georep1 scripts]# gluster volume info

Volume Name: master
Type: Distributed-Replicate
Volume ID: 7156c64c-a44b-40a4-98db-247a06d1f41e
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.96:/rhs/brick1/b1
Brick2: 10.70.46.97:/rhs/brick1/b1
Brick3: 10.70.46.96:/rhs/brick2/b2
Brick4: 10.70.46.97:/rhs/brick2/b2
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
performance.readdir-ahead: on
[root at georep1 scripts]# gluster volume remove-brick master replica 2
10.70.46.97:/rhs/brick1/b1 10.70.46.97:/rhs/brick2/b2 start
Connection failed. Please check if gluster daemon is operational.
[root at georep1 scripts]#

--- Additional comment from Rahul Hinduja on 2015-06-10 05:15:20 EDT ---

sosreport @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1230101/

Additional Info: this volume is part of geo-rep master cluster.

--- Additional comment from SATHEESARAN on 2015-06-10 05:44:58 EDT ---

I have tried to reproduce the issue.

Its reproducible only with the following case :

1. Created 2X3 distributed-replicate volume
2. Shrink it to 2X2 distributed-replicate volume
3. Shrink it to 2X2 to 2X1 distribute volume

Here are few more observations :
1. There is no crash observed when creating a 2X2 volume and shrinking it to
2X1
2. There is no crash observed when creating a 2X3 volume and shrinking it to
2X2
3. There is no crash observed when trying to remove each brick from all replica
sets and proper error message is thrown

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1230101
[Bug 1230101] [glusterd] glusterd crashed while trying to remove a bricks -
one selected from each replica set - after shrinking nX3 to nX2 to nX1
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.