[Gluster-devel] Core by test case : georep-tarssh-hybrid.t

Susant Palai spalai at redhat.com
Fri Apr 24 13:47:33 UTC 2015


Hi,
  Here is a speculation :

  With the introduction of multi-threaded epoll we are processing multiple responses at the same time. The crash happened in _gf_free which originated from dht_getxattr_cbk (as seen in the backtrace). In current state we don't have a frame lock inside dht_getxattr_cbk. Hence, this path is prone to races. 

Here is a code-snippet from dht_getxattr_cbk.
===============================================
        this_call_cnt = dht_frame_return (frame);                 
..
..
..
..


        if (!local->xattr) {
                local->xattr = dict_copy_with_ref (xattr, NULL);
        } else {
                dht_aggregate_xattr (local->xattr, xattr);
        }
out:
        if (is_last_call (this_call_cnt)) {
                DHT_STACK_UNWIND (getxattr, frame, local->op_ret, op_errno,
                                  local->xattr, NULL);
        }
        return 0;

===============================================
Here I am depicting the responses from two cbks from a two subvol cluster.
      
                    Thread:1  CBK1                                                   Thread:2  CBK2
                  ====================                                              =====================
time: 1.     this_call_cnt = 1 (2-1)                                       
 
time:2                                                                              this_call_cnt = 0 (1 - 1)

time:3      enters this function dict_copy_with_ref 

time:4                                                                             dht_aggregate_xattr

time:5                                                                             DHT_STACK_UNWIND [leading to dict_unref and destroy]

time:6      Still busy with dict_with_ref
            and tries to unref dict leading to
            free  which  is already freed in
            other thread. Hence, a double free.


Will compose a patch which encompass the  critical section under frame->lock.


Regards,
Susant

----- Original Message -----
> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> To: "Venky Shankar" <vshankar at redhat.com>, "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Cc: gluster-devel at gluster.org
> Sent: Friday, April 24, 2015 11:04:09 AM
> Subject: Re: [Gluster-devel] Core by test case : georep-tarssh-hybrid.t
> 
> I apologize, I thought it is the same issue that we assumed. I just
> looked into the stack trace in and is a different issue. This crash
> has happened when stime getxattr.
> 
> Pranith,
> You were working on min stime for ec, do you know abt this?
> 
> The trace looks like this.
> 
> 1.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64
> openssl-1.0.1e-30.el6.8.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0  0x00007f4d89c41380 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1  0x00007f4d8a714438 in __gf_free (free_ptr=0x7f4d70023550) at
> /home/jenkins/root/workspace/smoke/libglusterfs/src/mem-pool.c:303
> #2  0x00007f4d8a6ca1fb in data_destroy (data=0x7f4d87f27488) at
> /home/jenkins/root/workspace/smoke/libglusterfs/src/dict.c:148
> #3  0x00007f4d8a6caf46 in data_unref (this=0x7f4d87f27488) at
> /home/jenkins/root/workspace/smoke/libglusterfs/src/dict.c:549
> #4  0x00007f4d8a6cde55 in dict_get_bin (this=0x7f4d88108be8,
>     key=0x7f4d78131230
>     "trusted.glusterfs.2e9a9aed-0389-4ead-ad39-8196f875cd56.6fe2b66c-0f08-40c2-8a5b-93ce6daf8d32.stime",
>     bin=0x7f4d7de276d8)
>     at /home/jenkins/root/workspace/smoke/libglusterfs/src/dict.c:2231
> #5  0x00007f4d7cfa0d19 in gf_get_min_stime (this=0x7f4d7800d690,
> dst=0x7f4d88108be8,
>     key=0x7f4d78131230
>     "trusted.glusterfs.2e9a9aed-0389-4ead-ad39-8196f875cd56.6fe2b66c-0f08-40c2-8a5b-93ce6daf8d32.stime",
>     value=0x7f4d87f271b0)
>     at
>     /home/jenkins/root/workspace/smoke/xlators/cluster/afr/src/../../../../xlators/lib/src/libxlator.c:330
> #6  0x00007f4d7cd16419 in dht_aggregate (this=0x7f4d88108d8c,
>     key=0x7f4d78131230
>     "trusted.glusterfs.2e9a9aed-0389-4ead-ad39-8196f875cd56.6fe2b66c-0f08-40c2-8a5b-93ce6daf8d32.stime",
>     value=0x7f4d87f271b0, data=0x7f4d88108be8)
>     at
>     /home/jenkins/root/workspace/smoke/xlators/cluster/dht/src/dht-common.c:116
> #7  0x00007f4d8a6cc3b1 in dict_foreach_match (dict=0x7f4d88108d8c,
> match=0x7f4d8a6cc244 <dict_match_everything>, match_data=0x0,
>     action=0x7f4d7cd16330 <dht_aggregate>, action_data=0x7f4d88108be8) at
>     /home/jenkins/root/workspace/smoke/libglusterfs/src/dict.c:1182
> #8  0x00007f4d8a6cc2a4 in dict_foreach (dict=0x7f4d88108d8c,
> fn=0x7f4d7cd16330 <dht_aggregate>, data=0x7f4d88108be8)
>     at /home/jenkins/root/workspace/smoke/libglusterfs/src/dict.c:1141
> #9  0x00007f4d7cd165ae in dht_aggregate_xattr (dst=0x7f4d88108be8,
> src=0x7f4d88108d8c) at
> /home/jenkins/root/workspace/smoke/xlators/cluster/dht/src/dht-common.c:153
> #10 0x00007f4d7cd2415e in dht_getxattr_cbk (frame=0x7f4d8870d118,
> cookie=0x7f4d8870d1c4, this=0x7f4d7800d690, op_ret=0, op_errno=0,
> xattr=0x7f4d88108d8c, xdata=0x0)
>     at
>     /home/jenkins/root/workspace/smoke/xlators/cluster/dht/src/dht-common.c:2710
> #11 0x00007f4d7cf81293 in afr_getxattr_cbk (frame=0x7f4d8870d1c4, cookie=0x0,
> this=0x7f4d7800b560, op_ret=0, op_errno=0, dict=0x7f4d88108d8c, xdata=0x0)
>     at
>     /home/jenkins/root/workspace/smoke/xlators/cluster/afr/src/afr-inode-read.c:500
> #12 0x00007f4d7d1fd829 in client3_3_getxattr_cbk (req=0x7f4d75e59504,
> iov=0x7f4d75e59544, count=1, myframe=0x7f4d8870d270)
>     at
>     /home/jenkins/root/workspace/smoke/xlators/protocol/client/src/client-rpc-fops.c:1093
> #13 0x00007f4d8a4a0d1c in rpc_clnt_handle_reply (clnt=0x7f4d7811a100,
> pollin=0x7f4d7812c660) at
> /home/jenkins/root/workspace/smoke/rpc/rpc-lib/src/rpc-clnt.c:766
> #14 0x00007f4d8a4a113c in rpc_clnt_notify (trans=0x7f4d78129d70,
> mydata=0x7f4d7811a130, event=RPC_TRANSPORT_MSG_RECEIVED,
> data=0x7f4d7812c660)
>     at /home/jenkins/root/workspace/smoke/rpc/rpc-lib/src/rpc-clnt.c:894
> #15 0x00007f4d8a49d66c in rpc_transport_notify (this=0x7f4d78129d70,
> event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f4d7812c660)
>     at /home/jenkins/root/workspace/smoke/rpc/rpc-lib/src/rpc-transport.c:543
> #16 0x00007f4d7f44e311 in socket_event_poll_in (this=0x7f4d78129d70) at
> /home/jenkins/root/workspace/smoke/rpc/rpc-transport/socket/src/socket.c:2290
> #17 0x00007f4d7f44e7cc in socket_event_handler (fd=15, idx=4,
> data=0x7f4d78129d70, poll_in=1, poll_out=0, poll_err=0)
>     at
>     /home/jenkins/root/workspace/smoke/rpc/rpc-transport/socket/src/socket.c:2403
> #18 0x00007f4d8a747e2d in event_dispatch_epoll_handler (event_pool=0x1cc9ba0,
> event=0x7f4d7de27e70)
>     at /home/jenkins/root/workspace/smoke/libglusterfs/src/event-epoll.c:572
> #19 0x00007f4d8a748186 in event_dispatch_epoll_worker (data=0x1d11cc0) at
> /home/jenkins/root/workspace/smoke/libglusterfs/src/event-epoll.c:674
> #20 0x00007f4d89c3c9d1 in start_thread () from /lib64/libpthread.so.0
> #21 0x00007f4d895a68fd in clone () from /lib64/libc.so.6
> 
> 
> Thanks and Regards,
> Kotresh H R
> 
> ----- Original Message -----
> > From: "Venky Shankar" <vshankar at redhat.com>
> > To: gluster-devel at gluster.org
> > Sent: Friday, April 24, 2015 10:53:45 AM
> > Subject: Re: [Gluster-devel] Core by test case : georep-tarssh-hybrid.t
> > 
> > 
> > On 04/24/2015 10:22 AM, Kotresh Hiremath Ravishankar wrote:
> > > Hi Atin,
> > >
> > > It is not spurious, there is an issue with this pointer I think. All
> > > changelog consumers such as bitrot, geo-rep would see this. Since it's
> > > a race, it occurred with gsyncd.
> > 
> > Correct. Jeff has mentioned this a while ago. I'll help out Kotresh in
> > fixing this issue. In the meantime is it possible to disable
> > geo-replication regression test cases until this gets fixed?
> > 
> > >   
> > > No, the patch http://review.gluster.org/#/c/10340/ will not
> > > take care of it. It just improves the time taken for geo-rep
> > > regression.
> > >
> > > I am looking into it.
> > >
> > > Thanks and Regards,
> > > Kotresh H R
> > >
> > > ----- Original Message -----
> > >> From: "Atin Mukherjee" <amukherj at redhat.com>
> > >> To: "kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Aravinda
> > >> Vishwanathapura Krishna Murthy"
> > >> <avishwan at redhat.com>
> > >> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> > >> Sent: Friday, April 24, 2015 9:35:00 AM
> > >> Subject: Core by test case : georep-tarssh-hybrid.t
> > >>
> > >> [1] has core file generated by tests/geo-rep/georep-tarssh-hybrid.t. Is
> > >> it something alarming or http://review.gluster.org/#/c/10340/ would take
> > >> care of it?
> > >>
> > >> [1]
> > >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/7345/consoleFull
> > >> --
> > >> ~Atin
> > >>
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 


More information about the Gluster-devel mailing list