[Bugs] [Bug 1486538] New: [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk during rename/hardlink/ rebalance cases

bugzilla at redhat.com bugzilla at redhat.com
Wed Aug 30 04:26:25 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1486538

            Bug ID: 1486538
           Summary: [geo-rep+qr]: Crashes observed at slave from
                    qr_lookup_sbk during rename/hardlink/rebalance cases
           Product: GlusterFS
           Version: 3.12
         Component: distribute
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: nbalacha at redhat.com
                CC: atumball at redhat.com, bugs at gluster.org,
                    nbalacha at redhat.com, pkarampu at redhat.com,
                    ravishankar at redhat.com, rhinduja at redhat.com,
                    rhs-bugs at redhat.com
        Depends On: 1484113, 1484709



+++ This bug was initially created as a clone of Bug #1484709 +++

+++ This bug was initially created as a clone of Bug #1484113 +++

Description of problem:
=======================

While trying rebalance as part of remove brick containing hardlinks/rename io
pattern resulted in multiple crashes at slave with the following bt: 

(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>,
__dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata at entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0,
cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117,
inode_ret=0x7f212c05bb60, 
    buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at
quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00,
cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>,
op_errno=<optimized out>, 
    inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0,
postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete
(this=this at entry=0x7f2140016f40,
discover_frame=discover_frame at entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500,
cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>,
op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0,
postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>,
frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame at entry=0x7f212c0546c0, cookie=<optimized out>,
this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode at entry=0x7f212c05bb60, buf=buf at entry=0x7f2147537920,
xdata=0x7f21400adca0, postparent=postparent at entry=0x7f2147537990) at
afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380) at
client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f2140080400, pollin=pollin at entry=0x7f21400ee830) at
rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at
rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this at entry=0x7f21400805a0,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this at entry=0x7f21400805a0,
notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1,
data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80,
event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6
(gdb) 

Similar bug was seen when Master volume was tiered and slave was DR as
mentioned in BZ 1291617 . However this bug was verified after retrial of 3
times on the same setup. 

Raising this as new bug without tier volume and with heavy IO pattern. 




How reproducible:
=================

2/2


Steps Carried:
==============
1. Create Master volume (5x2) and Slave volume (4x2)
2. Setup geo-rep session between master and slave
3. Mount the Master volume and start the following IO:

for i in
{create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink,rename,hardlink,symlink,hardlink,rename,chown,rename,create,hardlink,hardlink,symlink,rename};
do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text
--fop=$i /mnt/master_nfs/b ; sleep 10 ; done

4. Wait for couple of mins for geo-rep to start syncing these files to slave. 
5. While syncing is in progress, remove one subvolume from slave volume (3x2)
via rebalance start 
6. After about 10 mins, remove one subvolume from Master volume too (4x2)
7. Keep checking rebalance status, once it is completed at slave, commit the
remove brick
8. Keep checking rebalance status, once it is completed at Master, stop the
geo-replication, commit the remove brick and start the geo-replication. 


Actual results:
===============

Crashes observed at slave


Expected results:
=================

No crashes should be seen


--- Additional comment from Nithya Balachandran on 2017-08-23 06:09:56 EDT ---

Here is what I have so far:

gdb -c /core.4537 /usr/sbin/glusterfs

Core was generated by `/usr/sbin/glusterfs --aux-gfid-mount --acl
--log-file=/var/log/glusterfs/geo-re'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64
libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64
libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64
openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64
sssd-client-1.15.2-50.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>,
__dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata at entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0,
cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117, 
    inode_ret=0x7f212c05bb60, buf=0x7f212c002b58, xdata=0x7f21340658c0,
postparent=0x7f212c002d88) at quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00,
cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, 
    op_errno=<optimized out>, inode=0x7f212c05bb60, stbuf=0x7f212c002b58,
xdata=0x7f21340658c0, postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete
(this=this at entry=0x7f2140016f40,
discover_frame=discover_frame at entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500,
cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>,
op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0,
postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>,
frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame at entry=0x7f212c0546c0, cookie=<optimized out>,
this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode at entry=0x7f212c05bb60, buf=buf at entry=0x7f2147537920,
xdata=0x7f21400adca0, postparent=postparent at entry=0x7f2147537990) at
afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f2140080400, pollin=pollin at entry=0x7f21400ee830) at
rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at
rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this at entry=0x7f21400805a0,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this at entry=0x7f21400805a0,
notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1,
data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80,
event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6
(gdb) 

The process has crashed in __memcpy_sse2 which was called in
qr_content_extract.

(gdb) f 2
#2  qr_content_extract (xdata=xdata at entry=0x7f21340658c0) at quick-read.c:279
279        memcpy (content, data->data, data->len);
(gdb) p *data
$1 = {is_static = 0 '\000', is_const = 0 '\000', len = -1162167622, data =
0x7f2134000cf0 "0", refcount = 0, lock = {spinlock = 0, mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1,
__spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats
19 times>, __align = 0}}}
(gdb) p/x *data
$2 = {is_static = 0x0, is_const = 0x0, len = 0xbabababa, data = 0x7f2134000cf0,
refcount = 0x0, lock = {spinlock = 0x0, mutex = {__data = {__lock = 0x0, 
        __count = 0x0, __owner = 0x0, __nusers = 0x0, __kind = 0xffffffff,
__spins = 0x0, __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size
= {
        0x0 <repeats 16 times>, 0xff, 0xff, 0xff, 0xff, 0x0 <repeats 20
times>}, __align = 0x0}}}
(gdb)

This data struct looks like it has already been freed. It looks like the memcpy
has the wrong len (data->len = 0xbabababa ) which causes the crash.

Analysing the core
-------------------

First, check if the client returns the correct xdata:

(gdb) f 9
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
2947            CLIENT_STACK_UNWIND (lookup, frame, rsp.op_ret, rsp.op_errno,
inode,

(gdb) p rsp.op_ret
$3 = 0
(gdb) p *xdata
$4 = {is_static = 0 '\000', hash_size = 1, count = 5, refcount = 2, members =
0x7f21400adcf8, members_list = 0x7f21400ce980, extra_free = 0x0, 
  extra_stdfree = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0,
__count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39
times>, __align = 0}}, members_internal = 0x7f21400ce980, free_pair = {
    hash_next = 0x0, prev = 0x7f2140003b70, next = 0x0, value = 0x7f21400da6c0,
key = 0x7f21400ea0f0 "link-count"}, free_pair_in_use = _gf_true}

(gdb) p *xdata->members[0]
$5 = {hash_next = 0x7f2140097640, prev = 0x0, next = 0x7f2140097640, value =
0x7f2140070740, key = 0x7f21400f0400 "glusterfs.content"}
(gdb) p *xdata->members[0]->value
$6 = {is_static = 0 '\000', is_const = 0 '\000', len = 5223, 
  data = 0x7f21400f6aa0 "# /etc/services:\n# $Id: services,v 1.55 2013/04/14
ovasik Exp $\n#\n# Network services, Internet style\n# IANA services version:
last updated 2013-04-10\n#\n# Note that it is presently the policy of IANA
t"..., refcount = 1, lock = {spinlock = 0, mutex = {__data = {__lock = 0, 
        __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0,
__elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}
(gdb) 


The value of "glusterfs.content" in the xdata returned by this client is
correct. However afr chooses a different subvol as the read subvol and returns
a different xdata to the higher layer as can be seen by the xdata returned to
dht_discover_cbk in f6 (xdata=0x7f21340658c0) versus the xdata this client
returned to afr in f8 (xdata=0x7f21400adca0). 

xdata=0x7f21340658c0 does not have the file contents.


(gdb) f 7
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>,
frame=0x7f212c0546c0) at afr-common.c:2624
2624        AFR_STACK_UNWIND (lookup, frame, local->op_ret, local->op_errno,

(gdb) p local->replies[0]->xdata
$16 = (dict_t *) 0x7f21400adca0   <---- good one
(gdb) p local->replies[1]->xdata
$17 = (dict_t *) 0x7f21340658c0   <---- "bad" one


AFR returns local->replies[1]->xdata.

(gdb) p *local->replies[1]->xdata
$18 = {is_static = 0 '\000', hash_size = 1, count = 5, refcount = 2, members =
0x7f2134065918, members_list = 0x7f21340387f0, extra_free = 0x0, 
  extra_stdfree = 0x0, lock = {spinlock = 0, mutex = {__data = {__lock = 0,
__count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39
times>, __align = 0}}, members_internal = 0x7f21340387f0, free_pair = {
    hash_next = 0x0, prev = 0x7f213406d630, next = 0x0, value = 0x7f21400f54e0,
key = 0x7f213401b230 "link-count"}, free_pair_in_use = _gf_true}

(gdb) p *local->replies[1]->xdata->members[0]
$19 = {hash_next = 0x7f21340398d0, prev = 0x0, next = 0x7f21340398d0, value =
0x7f2140002b50, key = 0x7f213401af10 "glusterfs.content"}
(gdb) p *local->replies[1]->xdata->members[0]->value
$20 = {is_static = 0 '\000', is_const = 0 '\000', len = 0, data =
0x7f21400c35e0 <incomplete sequence \360\255\272>, refcount = 1, lock =
{spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}

(gdb) p local->replies[1]->xdata->members[0]->value
$21 = (data_t *) 0x7f2140002b50

(gdb) p *local->replies[1]->xdata->members[0]->value
$24 = {is_static = 0 '\000', is_const = 0 '\000', len = 0, data =
0x7f21400c35e0 <incomplete sequence \360\255\272>, refcount = 1, lock =
{spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}}
(gdb) 



qr_content_extract should have got this data ( (data_t *) 0x7f2140002b50) when
it called

data = dict_get (xdata, GF_CONTENT_KEY);


However the data it has here is a completely different value - one that does
not exist in this xdata.


(gdb) f 2
#2  qr_content_extract (xdata=xdata at entry=0x7f21340658c0) at quick-read.c:279
279        memcpy (content, data->data, data->len);
(gdb) p data
$22 = (data_t *) 0x7f2134015300



Why it got this value for data is still not clear. Will continue to look into
this.

--- Additional comment from Nithya Balachandran on 2017-08-23 06:22:16 EDT ---

I see a lot of such messages in the brick logs:
rhs-brick1-b2.log:[2017-08-19 12:36:12.886812] W [MSGID: 113096]
[posix-handle.c:793:posix_handle_hard] 0-Slave-posix: mismatching ino/dev
between file
/rhs/brick1/b2/b/thread6/level00/level10/level20/level30/level40/level50/level60/level70/level80/5997ff7f%%3F00F9OEG1
(165686/64774) and handle
/rhs/brick1/b2/.glusterfs/cc/de/ccde18b3-e913-40b8-9183-ae39457d2e62
(10535691/64774)


Are these a cause for concern?

--- Additional comment from Amar Tumballi on 2017-08-23 09:18:50 EDT ---

>> I see a lot of such messages in the brick logs:
rhs-brick1-b2.log:[2017-08-19 12:36:12.886812] W [MSGID: 113096]
[posix-handle.c:793:posix_handle_hard] 0-Slave-posix: mismatching ino/dev
between file
/rhs/brick1/b2/b/thread6/level00/level10/level20/level30/level40/level50/level60/level70/level80/5997ff7f%%3F00F9OEG1
(165686/64774) and handle
/rhs/brick1/b2/.glusterfs/cc/de/ccde18b3-e913-40b8-9183-ae39457d2e62
(10535691/64774)

Looks to be concerning. Happens when we have separate mount point inside... but
how did hardlink succeed?

--- Additional comment from Nithya Balachandran on 2017-08-24 01:34:24 EDT ---

Ravi,

Any idea why AFR returned the xdata with the 0 len glusterfs.content ? In other
words, why is the file with no content considered the good copy? Unfortunately
the setup is no longer available so we cannot check the files on the bricks.

Regards,
Nithya

--- Additional comment from Nithya Balachandran on 2017-08-24 03:01:13 EDT ---

I think the underlying cause of this is that qr_content_extract uses dict_get()
to get the data value. As the data structure has not been ref-ed , it is
possible that it was deleted elsewhere causing qr to use a freed structure.

I see afr_lookup_done calls 

dict_del (replies[read_subvol].xdata, GF_CONTENT_KEY); 


I need to check with the AFR team if this can be called in this code path. In
the meantime, use dict_get_with_ref in qr_content_extract.

--- Additional comment from Nithya Balachandran on 2017-08-24 03:43:05 EDT ---

It looks like there are some nasty races in the dict code:


static int
dict_get_with_ref (dict_t *this, char *key, data_t **data)
{
...

        hash = SuperFastHash (key, strlen (key));

        LOCK (&this->lock);
        {
                pair = dict_lookup_common (this, key, hash);
        }
        UNLOCK (&this->lock);


<-- at this point, pair might have been deleted as the dictionary is no longer
locked. However, the function proceeds to use it anyway and we could end up
accessing a different key-value pair if it has been reused in the meantime. 

        if (pair) {
                ret = 0;
                *data = data_ref (pair->value);
        }

err:
        return ret;
}


For functions like 

int
dict_get_ptr (dict_t *this, char *key, void **ptr)
{
        data_t * data = NULL;
        int      ret  = 0;

        if (!this || !key || !ptr) {
                ret = -EINVAL;
                goto err;
        }

        ret = dict_get_with_ref (this, key, &data);
        if (ret != 0) {
                goto err;
        }

        ret = data_to_ptr_common (data, ptr);
        if (ret != 0) {
                goto err;
        }

err:
        if (data)
                data_unref (data);
<-- if this key is deleted here, the data can be unrefed and the pointers being
returned are no longer valid.

        return ret;
}

--- Additional comment from Worker Ant on 2017-08-24 04:17:09 EDT ---

REVIEW: https://review.gluster.org/18115 (perf/qr: Use a ref-ed data to extract
content) posted (#1) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Worker Ant on 2017-08-28 02:42:31 EDT ---

COMMIT: https://review.gluster.org/18115 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit 414d3e92fc56f08e320a3aa65b6b18e65b384551
Author: N Balachandran <nbalacha at redhat.com>
Date:   Thu Aug 24 13:39:07 2017 +0530

    perf/qr: Use a ref-ed data to extract content

    qr_content_extract used dict_get to get the value of
    the GF_CONTENT_KEY key. dict_get does not ref the data
    before returning it so QR could be acting on freed memory
    if another thread deletes the key before then.
    This patch also fixes a race in dict_get_with_ref.

    Fix: Use dict_get_with_ref to retrieve the file contents.

    Change-Id: Ib1a7a70bb92eed7e70747ec530e0b3edc53127ec
    BUG: 1484709
    Signed-off-by: N Balachandran <nbalacha at redhat.com>
    Reviewed-on: https://review.gluster.org/18115
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Amar Tumballi <amarts at redhat.com>
    Tested-by: Raghavendra G <rgowdapp at redhat.com>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>

--- Additional comment from Raghavendra G on 2017-08-28 02:45:44 EDT ---

> I think the underlying cause of this is that qr_content_extract uses dict_get() to get the data value. As the data structure has not been ref-ed , it is possible that it was deleted elsewhere causing qr to use a freed structure.

> I see afr_lookup_done calls 

> dict_del (replies[read_subvol].xdata, GF_CONTENT_KEY); 


> I need to check with the AFR team if this can be called in this code path. In the meantime, use dict_get_with_ref in qr_content_extract.

Since the bug describes a distribute-replicate setup, I checked the code of
afr_lookup_done. GF_CONTENT_KEY is deleted before afr calls STACK_UNWIND. So,
this can't happen in parallel to qr_content_extract.

--- Additional comment from Nithya Balachandran on 2017-08-29 10:48:50 EDT ---

RCA:

(gdb) bt
#0  0x00007f2152cebaeb in __memcpy_sse2 () from /lib64/libc.so.6
#1  0x00007f2145b9b824 in memcpy (__len=<optimized out>, __src=<optimized out>,
__dest=<optimized out>) at /usr/include/bits/string3.h:51
#2  qr_content_extract (xdata=xdata at entry=0x7f21340658c0) at quick-read.c:279
#3  0x00007f2145b9bdaa in qr_lookup_cbk (frame=0x7f212c05fff0,
cookie=<optimized out>, this=0x7f214001e4b0, op_ret=0, op_errno=117,
inode_ret=0x7f212c05bb60, 
    buf=0x7f212c002b58, xdata=0x7f21340658c0, postparent=0x7f212c002d88) at
quick-read.c:423
#4  0x00007f2145da7de7 in ioc_lookup_cbk (frame=0x7f212c057f00,
cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>,
op_errno=<optimized out>, 
    inode=0x7f212c05bb60, stbuf=0x7f212c002b58, xdata=0x7f21340658c0,
postparent=0x7f212c002d88) at io-cache.c:267
#5  0x00007f2146614627 in dht_discover_complete
(this=this at entry=0x7f2140016f40,
discover_frame=discover_frame at entry=0x7f212c061500) at dht-common.c:577
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500,
cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>,
op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0,
postparent=0x7f212c05e6b8) at dht-common.c:700
#7  0x00007f21468c4d10 in afr_discover_done (this=<optimized out>,
frame=0x7f212c0546c0) at afr-common.c:2624
#8  afr_discover_cbk (frame=frame at entry=0x7f212c0546c0, cookie=<optimized out>,
this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode at entry=0x7f212c05bb60, buf=buf at entry=0x7f2147537920,
xdata=0x7f21400adca0, postparent=postparent at entry=0x7f2147537990) at
afr-common.c:2669
#9  0x00007f2146b012c7 in client3_3_lookup_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f212c061380)
    at client-rpc-fops.c:2947
#10 0x00007f21543f2840 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f2140080400, pollin=pollin at entry=0x7f21400ee830) at
rpc-clnt.c:794
#11 0x00007f21543f2b27 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7f2140080430, event=<optimized out>, data=0x7f21400ee830) at
rpc-clnt.c:987
#12 0x00007f21543ee9e3 in rpc_transport_notify (this=this at entry=0x7f21400805a0,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f21400ee830)
    at rpc-transport.c:538
#13 0x00007f2148fef3d6 in socket_event_poll_in (this=this at entry=0x7f21400805a0,
notify_handled=<optimized out>) at socket.c:2306
#14 0x00007f2148ff197c in socket_event_handler (fd=13, idx=3, gen=1,
data=0x7f21400805a0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2458
#15 0x00007f21546841e6 in event_dispatch_epoll_handler (event=0x7f2147537e80,
event_pool=0x56504ff3cee0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x56504ff95d60) at event-epoll.c:648
#17 0x00007f2153488e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f2152d5534d in clone () from /lib64/libc.so.6




(gdb) f 6
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500,
cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>,
op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0,
postparent=0x7f212c05e6b8) at dht-common.c:700
700                    dht_discover_complete (this, frame);
(gdb) l
695             * which calls STACK_DESTROY (frame). In the case of
"attempt_unwind",
696             * this makes sure that the thread don't call dht_frame_return,
till
697             * call to dht_discover_complete is done.
698             */
699            if (attempt_unwind) {
700                    dht_discover_complete (this, frame);  <--- Here
701            }
702    
703            this_call_cnt = dht_frame_return (frame);
704


dht_discover_cbk has found a file so it tries to unwind before all subvols have
responded. So it is possible that other subvols have not yet responded.




Checking the frame for dht_discover_cbk:

gdb) f 6
#6  0x00007f214661534a in dht_discover_cbk (frame=0x7f212c061500,
cookie=0x7f21400135b0, this=0x7f2140016f40, op_ret=<optimized out>,
op_errno=117, 
    inode=0x7f212c05bb60, stbuf=0x7f212c05e648, xattr=0x7f21340658c0,
postparent=0x7f212c05e6b8) at dht-common.c:700
700                    dht_discover_complete (this, frame);
(gdb) p local->file_count
$6 = 3

Strangely a file with this gfid was found on _all_ 3 children. As
local->file_count doe not distinguish between data files or a linkto files,it
seems like there was a stale (?) linkto file on at least one subvol.

So at this point all 3 subvols have at least unwound to dht_discover_cbk. 

(gdb) p local->dir_count
$7 = 0

(gdb) p is_linkfile
$8 = <optimized out>
(gdb) p local->cached_subvol
$9 = (xlator_t *) 0x7f2140014d20
(gdb) p local->hashed_subvol
$10 = (xlator_t *) 0x0
(gdb) p *local->cached_subvol
$11 = {name = 0x7f21400158c0 "Slave-replicate-1", type = 0x7f2140015a40
"cluster/replicate", instance_name = 0x0, next = 0x7f21400135b0, prev =
0x7f2140015e30, 
  parents = 0x7f2140018630, children = 0x7f2140015d10, options =
0x7f2140015910, dlhandle = 0x7f2140014320, fops = 0x7f2146ad91a0 <fops>, 
  cbks = 0x7f2146ad90e0 <cbks>, dumpops = 0x7f2146ad9140 <dumpops>,
volume_options = {next = 0x7f2140015b40, prev = 0x7f2140015b40}, 
  fini = 0x7f21468c6ac0 <fini>, init = 0x7f21468c5e10 <init>, reconfigure =
0x7f21468c5740 <reconfigure>, mem_acct_init = 0x7f21468c5670 <mem_acct_init>, 
  notify = 0x7f21468c5610 <notify>, loglevel = GF_LOG_NONE, latencies = {{min =
0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, 
  history = 0x0, ctx = 0x56504ff06010, graph = 0x7f2140003f30, itable =
0x7f214005e0a0, init_succeeded = 1 '\001', private = 0x7f214005db00, 
  mem_acct = 0x7f214005a0b0, winds = 0, switched = 0 '\000', local_pool =
0x7f21548f2ab8 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0}
(gdb) p *prev
$12 = {name = 0x7f2140012740 "Slave-replicate-0", type = 0x7f2140014260
"cluster/replicate", instance_name = 0x0, next = 0x7f21400118a0, prev =
0x7f2140014d20, 
  parents = 0x7f2140018480, children = 0x7f2140014c00, options =
0x7f2140014130, dlhandle = 0x7f2140014320, fops = 0x7f2146ad91a0 <fops>, 
  cbks = 0x7f2146ad90e0 <cbks>, dumpops = 0x7f2146ad9140 <dumpops>,
volume_options = {next = 0x7f2140014a30, prev = 0x7f2140014a30}, 
  fini = 0x7f21468c6ac0 <fini>, init = 0x7f21468c5e10 <init>, reconfigure =
0x7f21468c5740 <reconfigure>, mem_acct_init = 0x7f21468c5670 <mem_acct_init>, 
  notify = 0x7f21468c5610 <notify>, loglevel = GF_LOG_NONE, latencies = {{min =
0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, 
  history = 0x0, ctx = 0x56504ff06010, graph = 0x7f2140003f30, itable =
0x7f2140065080, init_succeeded = 1 '\001', private = 0x7f2140064ae0, 
  mem_acct = 0x7f2140061090, winds = 0, switched = 0 '\000', local_pool =
0x7f21548f2ab8 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0}


As the local->cached subvol does not match prev, it is clear that
local->cached_subvol was updated after this thread proceeded to unwind to the
higher xlators. This means a data file was found on at least 2 subvols - this
could be because a linkto file was in the process of being created when the
lookup hit the brick.




In dht_discover_complete () :

        local = discover_frame->local;   
        layout = local->layout;
        conf = this->private;

        LOCK(&discover_frame->lock);
        {
                main_frame = local->main_frame;
                local->main_frame = NULL;
        }
        UNLOCK(&discover_frame->lock);

        if (!main_frame)
                return 0;    <--- So far so good - only one thread will unwind
here.

However, dicover_frame->local is valid and _still_ accessible to other threads
unwinding to dht_discover_cbk.




In dht_discover_cbk:
...

        local = frame->local;

...
        LOCK (&frame->lock);
        {

...
                if (is_dir) {
                        local->dir_count ++;
                } else {
                        local->file_count ++;

                        if (!is_linkfile) {
                                /* real file */
                                local->cached_subvol = prev;
                                attempt_unwind = 1;
                        } else {
                                goto unlock;
                        }
                }


...

                if (local->xattr == NULL) {
                        local->xattr = dict_ref (xattr); <-- say thread 1 (the
one that crashed )did this
                } else {
                        dht_aggregate_xattr (local->xattr, xattr);   <---  and
thread 2 called this (this should _only_ be done for directories.)
                }


In dht_aggregate_xattr () -> dht_aggregate ():

        ret = dict_set (dst, key, value);
        if (ret) {
                gf_msg ("dht", GF_LOG_WARNING, 0,
                        DHT_MSG_DICT_SET_FAILED,
                        "Failed to set dictionary value: key = %s",
                        key);
        }



dict_set _replaces_ the key which unrefs the old data, causing its ref-count to
become 0 and hence destroying it. 

So here is probably what happened:

Time:    subvol-0                                   subvol-1

t0      dht_discover_cbk
        local->xattr = dict_ref (dict0);
        dict_set (dict0, GF_CONTENT_KEY, data0)
    attempt_unwind = 1


t1     dht_discover_complete




t2     qr_content_extract gets data0



t3                                                 dht_discover_cbk
                                                   dht_aggregate_xattr
(local->xattr, xattr)
                             |->dict_set (dict0, GF_CONTENT_KEY, data1)  
                             |-><-- data_unref (data0)  <-- As ref-count = 0,
data0 is destroyed
                                               attempt_unwind = 1



t3    qr_content_extract accesses data0 and crashes

--- Additional comment from Worker Ant on 2017-08-29 11:51:49 EDT ---

REVIEW: https://review.gluster.org/18137 (cluster/dht: Aggregate xattrs only
for dirs in dht_discover_cbk) posted (#1) for review on master by N
Balachandran (nbalacha at redhat.com)


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1484113
[Bug 1484113] [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk
during rename/hardlink/rebalance cases
https://bugzilla.redhat.com/show_bug.cgi?id=1484709
[Bug 1484709] [geo-rep+qr]: Crashes observed at slave from qr_lookup_sbk
during rename/hardlink/rebalance cases
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list