[Bugs] [Bug 1184191] New: DHT: Rebalance- Rebalance process crash after remove-brick

Tue Jan 20 18:58:05 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1184191

            Bug ID: 1184191
           Summary: DHT: Rebalance- Rebalance process crash after
                    remove-brick
           Product: GlusterFS
           Version: 3.6.1
         Component: distribute
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: rabhat at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com,
                    nbalacha at redhat.com, rabhat at redhat.com,
                    rhs-bugs at redhat.com, shmohan at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1159571, 1162767, 1159280

+++ This bug was initially created as a clone of Bug #1159571 +++

+++ This bug was initially created as a clone of Bug #1159280 +++

Description of problem:

Version-Release number of selected component (if applicable):
glusterfs-3.6.1

How reproducible:

Steps to Reproduce:
1. created 6x2 dist-rep volume
2.  created some data on the mount point
3. started remove-brick

Actual results:
rebalance process crashed

Expected results:

Additional info:
Core was generated by `/usr/sbin/glusterfs
--volfile-server=rhs-client4.lab.eng.blr.redhat.com --volfi'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fbf389e9bcf in dht_lookup_everywhere_done (frame=0x7fbf3cb2c0f8,
this=0x1016db0) at dht-common.c:1189
1189                                   gf_log (this->name, GF_LOG_DEBUG,
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.107.el6_4.6.x86_64 keyutils-libs-1.4-4.el6.x86_64
krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-14.el6_4.4.x86_64
libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64
openssl-1.0.1e-16.el6_5.15.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00007fbf389e9bcf in dht_lookup_everywhere_done (frame=0x7fbf3cb2c0f8,
this=0x1016db0) at dht-common.c:1189
#1  0x00007fbf389ede1b in dht_lookup_everywhere_cbk (frame=0x7fbf3cb2c0f8,
cookie=<value optimized out>, this=0x1016db0, 
    op_ret=<value optimized out>, op_errno=<value optimized out>,
inode=0x7fbf30afe0c8, buf=0x7fbf3354085c, xattr=0x7fbf3c5271ac, 
    postparent=0x7fbf335408cc) at dht-common.c:1515
#2  0x00007fbf38c6a298 in afr_lookup_done (frame=0x7fbeffffffc6,
cookie=0x7ffff2e0b8e8, this=0x1016320, 
    op_ret=<value optimized out>, op_errno=8, inode=0x11ce7e0,
buf=0x7ffff2e0bb40, xattr=0x7fbf3c527238, postparent=0x7ffff2e0bad0)
    at afr-common.c:2223
#3  afr_lookup_cbk (frame=0x7fbeffffffc6, cookie=0x7ffff2e0b8e8,
this=0x1016320, op_ret=<value optimized out>, op_errno=8, 
    inode=0x11ce7e0, buf=0x7ffff2e0bb40, xattr=0x7fbf3c527238,
postparent=0x7ffff2e0bad0) at afr-common.c:2454
#4  0x00007fbf38ea6a33 in client3_3_lookup_cbk (req=<value optimized out>,
iov=<value optimized out>, count=<value optimized out>, 
    myframe=0x7fbf3cb2ba40) at client-rpc-fops.c:2610
#5  0x00000035cac0e005 in rpc_clnt_handle_reply (clnt=0x1076630,
pollin=0x10065e0) at rpc-clnt.c:773
#6  0x00000035cac0f5c7 in rpc_clnt_notify (trans=<value optimized out>,
mydata=0x1076660, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:906
#7  0x00000035cac0ae48 in rpc_transport_notify (this=<value optimized out>,
event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:512
#8  0x00007fbf3a105e36 in socket_event_poll_in (this=0x1086060) at
socket.c:2136
#9  0x00007fbf3a10775d in socket_event_handler (fd=<value optimized out>,
idx=<value optimized out>, data=0x1086060, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2246
#10 0x00000035ca462997 in event_dispatch_epoll_handler (event_pool=0xfe4ee0) at
event-epoll.c:384
#11 event_dispatch_epoll (event_pool=0xfe4ee0) at event-epoll.c:445
#12 0x00000000004069d7 in main (argc=4, argv=0x7ffff2e0d7e8) at
glusterfsd.c:2050

(gdb) bt
#0  0x00007f4d8d522bcf in dht_lookup_everywhere_done (frame=0x7f4d9145f85c, 
    this=0x22c1470) at dht-common.c:1189
#1  0x00007f4d8d526e1b in dht_lookup_everywhere_cbk (frame=0x7f4d9145f85c, 
    cookie=<value optimized out>, this=0x22c1470, 
    op_ret=<value optimized out>, op_errno=<value optimized out>, 
    inode=0x7f4d8396b53c, buf=0x7f4d8c100d38, xattr=0x7f4d90e5ab84, 
    postparent=0x7f4d8c100da8) at dht-common.c:1515
#2  0x00007f4d8d7a3298 in afr_lookup_done (frame=0x7f4cffffffc6, 
    cookie=0x7fffb7202cd8, this=0x22c09e0, op_ret=<value optimized out>, 
    op_errno=8, inode=0x25ccd70, buf=0x7fffb7202f30, xattr=0x7f4d90e5ab84, 
    postparent=0x7fffb7202ec0) at afr-common.c:2223
#3  afr_lookup_cbk (frame=0x7f4cffffffc6, cookie=0x7fffb7202cd8, 
    this=0x22c09e0, op_ret=<value optimized out>, op_errno=8, inode=0x25ccd70, 
    buf=0x7fffb7202f30, xattr=0x7f4d90e5ab84, postparent=0x7fffb7202ec0)
    at afr-common.c:2454
#4  0x00007f4d8d9dfa33 in client3_3_lookup_cbk (req=<value optimized out>, 
    iov=<value optimized out>, count=<value optimized out>, 
    myframe=0x7f4d9145eef4) at client-rpc-fops.c:2610
#5  0x00000035cac0e005 in rpc_clnt_handle_reply (clnt=0x22fc990, 
    pollin=0x230d380) at rpc-clnt.c:773
#6  0x00000035cac0f5c7 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x22fc9c0, event=<value optimized out>, data=<value optimized out>)
    at rpc-clnt.c:906
---Type <return> to continue, or q <return> to quit---
#7  0x00000035cac0ae48 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:512
#8  0x00007f4d8ea38e36 in socket_event_poll_in (this=0x230c420)
    at socket.c:2136
#9  0x00007f4d8ea3a75d in socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x230c420, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2246
#10 0x00000035ca462997 in event_dispatch_epoll_handler (event_pool=0x2288ee0)
    at event-epoll.c:384
#11 event_dispatch_epoll (event_pool=0x2288ee0) at event-epoll.c:445
#12 0x00000000004069d7 in main (argc=11, argv=0x7fffb7204bd8)
    at glusterfsd.c:2050

(gdb) l
1184                                    goto unwind_hashed_and_cached;
1185                            } else {
1186    
1187                                   local->skip_unlink.handle_valid_link =
_gf_false;
1188    
1189                                   gf_log (this->name, GF_LOG_DEBUG,
1190                                           "Linkto file found on hashed
subvol "
1191                                           "and data file found on cached "
1192                                           "subvolume. But linkto points to
"
1193                                           "different cached subvolume (%s)
"
(gdb) 
1194                                           "path %s",
1195                                          
local->skip_unlink.hash_links_to->name,
1196                                           local->loc.path);
1197    
1198                                   if (local->skip_unlink.opend_fd_count ==
0) {
1199    

(gdb) p local->skip_unlink.hash_links_to
$2 = (xlator_t *) 0x0
(gdb) p local->skip_unlink.hash_links_to->name
Cannot access memory at address 0x0
(gdb) p local->loc.path
$1 = 0x7f4d7c019000 "/test/f411"

(gdb) p *(dht_conf_t *)this->private
$4 = {subvolume_lock = 1, subvolume_cnt = 8, subvolumes = 0x22d57c0, 
  subvolume_status = 0x22d5810 "\001\001\001\001\001\001\001\001", 
  last_event = 0x22d5830, file_layouts = 0x22d6650, dir_layouts = 0x0, 
....

The trusted.glusterfs.dht.linkto="qtest-replicate-8" for  "/test/f411". This
points to the brick that was removed and is not found in the conf->subvolumes
list.

(gdb) p ((dht_conf_t *)this->private)->subvolumes[0]->name
$18 = 0x22ba0a0 "qtest-replicate-0"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[1]->name
$19 = 0x22bb780 "qtest-replicate-1"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[2]->name
$20 = 0x22bc9b0 "qtest-replicate-2"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[3]->name
$21 = 0x22bd420 "qtest-replicate-3"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[4]->name
$22 = 0x22bdeb0 "qtest-replicate-4"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[5]->name
$23 = 0x22be940 "qtest-replicate-5"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[6]->name
$24 = 0x22bf3d0 "qtest-replicate-6"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[7]->name
$25 = 0x22bfe60 "qtest-replicate-7"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[8]->name
Cannot access memory at address 0x0

The local->skip_unlink.hash_links_to value is set in
dht_lookup_everywhere_cbk() without checking if it NULL:

                if (is_linkfile) {
                        link_subvol = dht_linkfile_subvol (this, inode, buf,
                                                           xattr);
                        gf_msg_debug (this->name, 0,
                                      "found on %s linkfile %s (-> %s)",
                                      subvol->name, loc->path,
                                      link_subvol ? link_subvol->name : "''");
                        goto unlock;
                }
 ...
 ...

======================================================================================================
On the bricks:

[root at rhs-client4 ~]# getfattr  -d -m . /home/qtest*/test/f411
getfattr: Removing leading '/' from absolute path names
# file: home/qtest12/test/f411
trusted.gfid=0sJg43JQHHRJST/cXjXyY0wg==
trusted.glusterfs.dht.linkto="qtest-replicate-8"  <----------THIS!!!
trusted.glusterfs.quota.f3874c91-e295-45d9-a95a-252d54b15ba0.contri=0sAAAAAAAAAAA=
trusted.pgfid.f3874c91-e295-45d9-a95a-252d54b15ba0=0sAAAAAQ==

# file: home/qtest17/test/f411
trusted.afr.qtest-client-16=0sAAAAAAAAAAAAAAAA
trusted.afr.qtest-client-17=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sJg43JQHHRJST/cXjXyY0wg==
trusted.glusterfs.quota.f3874c91-e295-45d9-a95a-252d54b15ba0.contri=0sAAAAAAAQAAA=
trusted.pgfid.f3874c91-e295-45d9-a95a-252d54b15ba0=0sAAAAAQ==

======================================================================================================

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1159280
[Bug 1159280] DHT: Rebalance- Rebalance process crash after remove-brick
https://bugzilla.redhat.com/show_bug.cgi?id=1159571
[Bug 1159571] DHT: Rebalance- Rebalance process crash after remove-brick
https://bugzilla.redhat.com/show_bug.cgi?id=1162767
[Bug 1162767] DHT: Rebalance- Rebalance process crash after remove-brick
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.