[Bugs] [Bug 1184191] New: DHT: Rebalance- Rebalance process crash after remove-brick
bugzilla at redhat.com
bugzilla at redhat.com
Tue Jan 20 18:58:05 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1184191
Bug ID: 1184191
Summary: DHT: Rebalance- Rebalance process crash after
remove-brick
Product: GlusterFS
Version: 3.6.1
Component: distribute
Severity: high
Assignee: bugs at gluster.org
Reporter: rabhat at redhat.com
CC: bugs at gluster.org, gluster-bugs at redhat.com,
nbalacha at redhat.com, rabhat at redhat.com,
rhs-bugs at redhat.com, shmohan at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1159571, 1162767, 1159280
+++ This bug was initially created as a clone of Bug #1159571 +++
+++ This bug was initially created as a clone of Bug #1159280 +++
Description of problem:
Version-Release number of selected component (if applicable):
glusterfs-3.6.1
How reproducible:
Steps to Reproduce:
1. created 6x2 dist-rep volume
2. created some data on the mount point
3. started remove-brick
Actual results:
rebalance process crashed
Expected results:
Additional info:
Core was generated by `/usr/sbin/glusterfs
--volfile-server=rhs-client4.lab.eng.blr.redhat.com --volfi'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fbf389e9bcf in dht_lookup_everywhere_done (frame=0x7fbf3cb2c0f8,
this=0x1016db0) at dht-common.c:1189
1189 gf_log (this->name, GF_LOG_DEBUG,
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.107.el6_4.6.x86_64 keyutils-libs-1.4-4.el6.x86_64
krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-14.el6_4.4.x86_64
libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64
openssl-1.0.1e-16.el6_5.15.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x00007fbf389e9bcf in dht_lookup_everywhere_done (frame=0x7fbf3cb2c0f8,
this=0x1016db0) at dht-common.c:1189
#1 0x00007fbf389ede1b in dht_lookup_everywhere_cbk (frame=0x7fbf3cb2c0f8,
cookie=<value optimized out>, this=0x1016db0,
op_ret=<value optimized out>, op_errno=<value optimized out>,
inode=0x7fbf30afe0c8, buf=0x7fbf3354085c, xattr=0x7fbf3c5271ac,
postparent=0x7fbf335408cc) at dht-common.c:1515
#2 0x00007fbf38c6a298 in afr_lookup_done (frame=0x7fbeffffffc6,
cookie=0x7ffff2e0b8e8, this=0x1016320,
op_ret=<value optimized out>, op_errno=8, inode=0x11ce7e0,
buf=0x7ffff2e0bb40, xattr=0x7fbf3c527238, postparent=0x7ffff2e0bad0)
at afr-common.c:2223
#3 afr_lookup_cbk (frame=0x7fbeffffffc6, cookie=0x7ffff2e0b8e8,
this=0x1016320, op_ret=<value optimized out>, op_errno=8,
inode=0x11ce7e0, buf=0x7ffff2e0bb40, xattr=0x7fbf3c527238,
postparent=0x7ffff2e0bad0) at afr-common.c:2454
#4 0x00007fbf38ea6a33 in client3_3_lookup_cbk (req=<value optimized out>,
iov=<value optimized out>, count=<value optimized out>,
myframe=0x7fbf3cb2ba40) at client-rpc-fops.c:2610
#5 0x00000035cac0e005 in rpc_clnt_handle_reply (clnt=0x1076630,
pollin=0x10065e0) at rpc-clnt.c:773
#6 0x00000035cac0f5c7 in rpc_clnt_notify (trans=<value optimized out>,
mydata=0x1076660, event=<value optimized out>,
data=<value optimized out>) at rpc-clnt.c:906
#7 0x00000035cac0ae48 in rpc_transport_notify (this=<value optimized out>,
event=<value optimized out>, data=<value optimized out>)
at rpc-transport.c:512
#8 0x00007fbf3a105e36 in socket_event_poll_in (this=0x1086060) at
socket.c:2136
#9 0x00007fbf3a10775d in socket_event_handler (fd=<value optimized out>,
idx=<value optimized out>, data=0x1086060, poll_in=1,
poll_out=0, poll_err=0) at socket.c:2246
#10 0x00000035ca462997 in event_dispatch_epoll_handler (event_pool=0xfe4ee0) at
event-epoll.c:384
#11 event_dispatch_epoll (event_pool=0xfe4ee0) at event-epoll.c:445
#12 0x00000000004069d7 in main (argc=4, argv=0x7ffff2e0d7e8) at
glusterfsd.c:2050
(gdb) bt
#0 0x00007f4d8d522bcf in dht_lookup_everywhere_done (frame=0x7f4d9145f85c,
this=0x22c1470) at dht-common.c:1189
#1 0x00007f4d8d526e1b in dht_lookup_everywhere_cbk (frame=0x7f4d9145f85c,
cookie=<value optimized out>, this=0x22c1470,
op_ret=<value optimized out>, op_errno=<value optimized out>,
inode=0x7f4d8396b53c, buf=0x7f4d8c100d38, xattr=0x7f4d90e5ab84,
postparent=0x7f4d8c100da8) at dht-common.c:1515
#2 0x00007f4d8d7a3298 in afr_lookup_done (frame=0x7f4cffffffc6,
cookie=0x7fffb7202cd8, this=0x22c09e0, op_ret=<value optimized out>,
op_errno=8, inode=0x25ccd70, buf=0x7fffb7202f30, xattr=0x7f4d90e5ab84,
postparent=0x7fffb7202ec0) at afr-common.c:2223
#3 afr_lookup_cbk (frame=0x7f4cffffffc6, cookie=0x7fffb7202cd8,
this=0x22c09e0, op_ret=<value optimized out>, op_errno=8, inode=0x25ccd70,
buf=0x7fffb7202f30, xattr=0x7f4d90e5ab84, postparent=0x7fffb7202ec0)
at afr-common.c:2454
#4 0x00007f4d8d9dfa33 in client3_3_lookup_cbk (req=<value optimized out>,
iov=<value optimized out>, count=<value optimized out>,
myframe=0x7f4d9145eef4) at client-rpc-fops.c:2610
#5 0x00000035cac0e005 in rpc_clnt_handle_reply (clnt=0x22fc990,
pollin=0x230d380) at rpc-clnt.c:773
#6 0x00000035cac0f5c7 in rpc_clnt_notify (trans=<value optimized out>,
mydata=0x22fc9c0, event=<value optimized out>, data=<value optimized out>)
at rpc-clnt.c:906
---Type <return> to continue, or q <return> to quit---
#7 0x00000035cac0ae48 in rpc_transport_notify (this=<value optimized out>,
event=<value optimized out>, data=<value optimized out>)
at rpc-transport.c:512
#8 0x00007f4d8ea38e36 in socket_event_poll_in (this=0x230c420)
at socket.c:2136
#9 0x00007f4d8ea3a75d in socket_event_handler (fd=<value optimized out>,
idx=<value optimized out>, data=0x230c420, poll_in=1, poll_out=0,
poll_err=0) at socket.c:2246
#10 0x00000035ca462997 in event_dispatch_epoll_handler (event_pool=0x2288ee0)
at event-epoll.c:384
#11 event_dispatch_epoll (event_pool=0x2288ee0) at event-epoll.c:445
#12 0x00000000004069d7 in main (argc=11, argv=0x7fffb7204bd8)
at glusterfsd.c:2050
(gdb) l
1184 goto unwind_hashed_and_cached;
1185 } else {
1186
1187 local->skip_unlink.handle_valid_link =
_gf_false;
1188
1189 gf_log (this->name, GF_LOG_DEBUG,
1190 "Linkto file found on hashed
subvol "
1191 "and data file found on cached "
1192 "subvolume. But linkto points to
"
1193 "different cached subvolume (%s)
"
(gdb)
1194 "path %s",
1195
local->skip_unlink.hash_links_to->name,
1196 local->loc.path);
1197
1198 if (local->skip_unlink.opend_fd_count ==
0) {
1199
(gdb) p local->skip_unlink.hash_links_to
$2 = (xlator_t *) 0x0
(gdb) p local->skip_unlink.hash_links_to->name
Cannot access memory at address 0x0
(gdb) p local->loc.path
$1 = 0x7f4d7c019000 "/test/f411"
(gdb) p *(dht_conf_t *)this->private
$4 = {subvolume_lock = 1, subvolume_cnt = 8, subvolumes = 0x22d57c0,
subvolume_status = 0x22d5810 "\001\001\001\001\001\001\001\001",
last_event = 0x22d5830, file_layouts = 0x22d6650, dir_layouts = 0x0,
....
The trusted.glusterfs.dht.linkto="qtest-replicate-8" for "/test/f411". This
points to the brick that was removed and is not found in the conf->subvolumes
list.
(gdb) p ((dht_conf_t *)this->private)->subvolumes[0]->name
$18 = 0x22ba0a0 "qtest-replicate-0"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[1]->name
$19 = 0x22bb780 "qtest-replicate-1"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[2]->name
$20 = 0x22bc9b0 "qtest-replicate-2"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[3]->name
$21 = 0x22bd420 "qtest-replicate-3"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[4]->name
$22 = 0x22bdeb0 "qtest-replicate-4"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[5]->name
$23 = 0x22be940 "qtest-replicate-5"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[6]->name
$24 = 0x22bf3d0 "qtest-replicate-6"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[7]->name
$25 = 0x22bfe60 "qtest-replicate-7"
(gdb) p ((dht_conf_t *)this->private)->subvolumes[8]->name
Cannot access memory at address 0x0
The local->skip_unlink.hash_links_to value is set in
dht_lookup_everywhere_cbk() without checking if it NULL:
if (is_linkfile) {
link_subvol = dht_linkfile_subvol (this, inode, buf,
xattr);
gf_msg_debug (this->name, 0,
"found on %s linkfile %s (-> %s)",
subvol->name, loc->path,
link_subvol ? link_subvol->name : "''");
goto unlock;
}
...
...
======================================================================================================
On the bricks:
[root at rhs-client4 ~]# getfattr -d -m . /home/qtest*/test/f411
getfattr: Removing leading '/' from absolute path names
# file: home/qtest12/test/f411
trusted.gfid=0sJg43JQHHRJST/cXjXyY0wg==
trusted.glusterfs.dht.linkto="qtest-replicate-8" <----------THIS!!!
trusted.glusterfs.quota.f3874c91-e295-45d9-a95a-252d54b15ba0.contri=0sAAAAAAAAAAA=
trusted.pgfid.f3874c91-e295-45d9-a95a-252d54b15ba0=0sAAAAAQ==
# file: home/qtest17/test/f411
trusted.afr.qtest-client-16=0sAAAAAAAAAAAAAAAA
trusted.afr.qtest-client-17=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sJg43JQHHRJST/cXjXyY0wg==
trusted.glusterfs.quota.f3874c91-e295-45d9-a95a-252d54b15ba0.contri=0sAAAAAAAQAAA=
trusted.pgfid.f3874c91-e295-45d9-a95a-252d54b15ba0=0sAAAAAQ==
======================================================================================================
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1159280
[Bug 1159280] DHT: Rebalance- Rebalance process crash after remove-brick
https://bugzilla.redhat.com/show_bug.cgi?id=1159571
[Bug 1159571] DHT: Rebalance- Rebalance process crash after remove-brick
https://bugzilla.redhat.com/show_bug.cgi?id=1162767
[Bug 1162767] DHT: Rebalance- Rebalance process crash after remove-brick
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list