[Gluster-devel] Spurious regression of tests/basic/mgmt_v3-locks.t

Atin Mukherjee amukherj at redhat.com
Fri Dec 12 06:02:31 UTC 2014


Issue is cracked now!!

http://review.gluster.org/#/c/9269/ should solve it. Commit message is
self explanatory, however just briefing the problem and solution:

If we look at the mgmt_v3-locks.t, it tries to perform multiple syncop
transactions parallely. Now since these operations are attempted on two
different volumes they are allowed. The current implementation uses
xaction_peers to store the list of peers which are part of the cluster
before the syncop transaction begins, this is just to ensure the
staticness of the peer list even though new peer is been added in
between. However this xaction_peers was stored as global which means
multiple transaction was manipulating the same list which might end up
having a corrupted xaction_peers list.

Please note the mentioned patch only fixes the syncop framework, I've
also identified the same problem for op-sm as well which will be
addressed in a different patch.

~Atin

On 12/09/2014 09:46 AM, Atin Mukherjee wrote:
> Justin,
> 
> I could only get a chance to look into this yesterday as last week I was
> in RHEL 7 training. I've some interesting facts to share which is as
> follows:
> 
> When I execute the script which sets options for two different volumes
> (one in background) in a loop, after few iterations the memory
> consumption of glusterd jumps to more than 1G. The iteration cycle of
> the occurrence is not fixed though. However there is a constant increase
> in memory consumption in every iteration which proves that there is
> certainly a memory leak here. I am going to spend some time today to
> debug it further and hope to get the root cause.
> 
> ~Atin
> 
> On 12/09/2014 03:43 AM, Justin Clift wrote:
>> How's this investigation stuff going? :)
>>
>> + Justin
>>
>>
>> ----- Original Message -----
>>> Agreed Justin. It looks like a bug only.
>>> I found something significant now. I wrote a small script to execute two
>>> volume set commands in parallel (for two different volumes) in a loop of
>>> 1000 iterations and I figured out glusterd hung after few iterations.
>>> Glusterd was not dead but it didn't respond at all.
>>>
>>> Following is the backtrace:
>>>
>>> Thread 6 (Thread 0x7f6f6ecf5700 (LWP 30417)):
>>> #0  0x00007f6f76d67fbd in nanosleep () from /lib64/libpthread.so.0
>>> #1  0x00007f6f77a04014 in gf_timer_proc (ctx=0x1396010) at timer.c:170
>>> #2  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
>>> #3  0x00007f6f766a7ded in clone () from /lib64/libc.so.6
>>>
>>> Thread 5 (Thread 0x7f6f6e4f4700 (LWP 30418)):
>>> #0  0x00007f6f76d684f1 in sigwait () from /lib64/libpthread.so.0
>>> #1  0x00000000004079e7 in glusterfs_sigwaiter (arg=<optimized out>) at
>>> glusterfsd.c:1728
>>> #2  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
>>> #3  0x00007f6f766a7ded in clone () from /lib64/libc.so.6
>>>
>>> Thread 4 (Thread 0x7f6f6dcf3700 (LWP 30419)):
>>> #0  0x00007f6f76d6759d in __lll_lock_wait () from /lib64/libpthread.so.0
>>> #1  0x00007f6f76d63194 in _L_lock_874 () from /lib64/libpthread.so.0
>>> #2  0x00007f6f76d63093 in pthread_mutex_lock () from /lib64/libpthread.so.0
>>> #3  0x00007f6f766e2612 in _dl_addr () from /lib64/libc.so.6
>>> #4  0x00007f6f766b67d5 in backtrace_symbols_fd () from /lib64/libc.so.6
>>> #5  0x00007f6f779fdc88 in gf_backtrace_fillframes (
>>>     buf=buf at entry=0x13eb280 "(-->
>>> /usr/local/lib/libglusterfs.so.0(synctask_yield+0x2c)[0x7f6f77a21cac]
>>> (--> /usr/local/lib/libglusterfs.so.0(+0x57dc9)[0x7f6f77a21dc9] (-->
>>> /usr/local/lib/libglusterfs.so.0(synclock_lock+0x16)[0x7"...) at
>>> common-utils.c:3335
>>> ---Type <return> to continue, or q <return> to quit---
>>> #6  0x00007f6f77a03631 in gf_backtrace_save (
>>>     buf=buf at entry=0x13eb280 "(-->
>>> /usr/local/lib/libglusterfs.so.0(synctask_yield+0x2c)[0x7f6f77a21cac]
>>> (--> /usr/local/lib/libglusterfs.so.0(+0x57dc9)[0x7f6f77a21dc9] (-->
>>> /usr/local/lib/libglusterfs.so.0(synclock_lock+0x16)[0x7"...) at
>>> common-utils.c:3391
>>> #7  0x00007f6f77a21cac in synctask_yield (task=task at entry=0x13eadf0) at
>>> syncop.c:296
>>> #8  0x00007f6f77a21dc9 in __synclock_lock (lock=lock at entry=0x13e0138) at
>>> syncop.c:760
>>> #9  0x00007f6f77a24586 in synclock_lock (lock=lock at entry=0x13e0138) at
>>> syncop.c:784
>>> #10 0x00007f6f6ca3242d in glusterd_big_locked_cbk (req=0x13e34ec,
>>> iov=0x0, count=0,
>>>     myframe=0x7f6f759ff24c, fn=0x7f6f6ca766d0 <_gd_syncop_stage_op_cbk>)
>>>     at glusterd-rpc-ops.c:207
>>> #11 0x00007f6f777be9ff in rpc_clnt_submit (rpc=rpc at entry=0x13e3260,
>>>     prog=prog at entry=0x7f6f6cce9ca0 <gd_mgmt_prog>, procnum=procnum at entry=3,
>>>     cbkfn=cbkfn at entry=0x7f6f6ca75330 <gd_syncop_stage_op_cbk>,
>>>     proghdr=proghdr at entry=0x17eb970, proghdrcount=proghdrcount at entry=1,
>>>     progpayload=progpayload at entry=0x0,
>>> progpayloadcount=progpayloadcount at entry=0,
>>>     iobref=iobref at entry=0x7f6f4bd4a9c0, frame=frame at entry=0x7f6f759ff24c,
>>>     rsphdr=rsphdr at entry=0x0, rsphdr_count=rsphdr_count at entry=0,
>>>     rsp_payload=rsp_payload at entry=0x0,
>>> rsp_payload_count=rsp_payload_count at entry=0,
>>>     rsp_iobref=rsp_iobref at entry=0x0) at rpc-clnt.c:1601
>>> #12 0x00007f6f6ca76470 in gd_syncop_submit_request (rpc=0x13e3260,
>>>     req=req at entry=0x7f6f60239830, local=local at entry=0x17ebaa0,
>>>     cookie=cookie at entry=0x13e2770, prog=0x7f6f6cce9ca0 <gd_mgmt_prog>,
>>>     procnum=procnum at entry=3, cbkfn=cbkfn at entry=0x7f6f6ca75330
>>> <gd_syncop_stage_op_cbk>,
>>> ---Type <return> to continue, or q <return> to quit---
>>>     xdrproc=0x7f6f775a4910 <xdr_gd1_mgmt_stage_op_req>) at
>>> glusterd-syncop.c:198
>>> #13 0x00007f6f6ca773d6 in gd_syncop_mgmt_stage_op
>>> (peerinfo=peerinfo at entry=0x13e2770,
>>>     args=args at entry=0x17ebaa0,
>>>     my_uuid=my_uuid at entry=0x13deff8
>>> "y\205mKLMF\340\241\366\340\037#\227$x/var/lib/glusterd",
>>> recv_uuid=recv_uuid at entry=0x17eba90 "", op=op at entry=11,
>>>     dict_out=dict_out at entry=0x7f6f754045e4,
>>> op_ctx=op_ctx at entry=0x7f6f754045e4)
>>>     at glusterd-syncop.c:749
>>> #14 0x00007f6f6ca77d4b in gd_stage_op_phase (peers=peers at entry=0x13defe0,
>>>     op=<optimized out>, op_ctx=op_ctx at entry=0x7f6f754045e4,
>>> req_dict=0x7f6f754045e4,
>>>     op_errstr=op_errstr at entry=0x17ec2a8, npeers=npeers at entry=1) at
>>> glusterd-syncop.c:1169
>>> #15 0x00007f6f6ca790b8 in gd_sync_task_begin
>>> (op_ctx=op_ctx at entry=0x7f6f754045e4,
>>>     req=req at entry=0x13da238) at glusterd-syncop.c:1619
>>> #16 0x00007f6f6ca7928c in glusterd_op_begin_synctask
>>> (req=req at entry=0x13da238,
>>>     op=op at entry=GD_OP_SET_VOLUME, dict=0x7f6f754045e4) at
>>> glusterd-syncop.c:1680
>>> #17 0x00007f6f6c9e52e6 in __glusterd_handle_set_volume
>>> (req=req at entry=0x13da238)
>>>     at glusterd-handler.c:1762
>>> #18 0x00007f6f6c9f1d60 in glusterd_big_locked_handler (req=0x13da238,
>>>     actor_fn=0x7f6f6c9e4dd0 <__glusterd_handle_set_volume>) at
>>> glusterd-handler.c:79
>>> #19 0x00007f6f77a21d32 in synctask_wrap (old_task=<optimized out>) at
>>> syncop.c:335
>>> #20 0x00007f6f765faa00 in ?? () from /lib64/libc.so.6
>>> #21 0x0000000000000000 in ?? ()
>>>
>>> Thread 3 (Thread 0x7f6f6d4f2700 (LWP 30420)):
>>> ---Type <return> to continue, or q <return> to quit---
>>> #0  0x00007f6f76d650c9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>>> /lib64/libpthread.so.0
>>> #1  0x00007f6f77a237ef in syncenv_task (proc=proc at entry=0x13bfe90) at
>>> syncop.c:535
>>> #2  0x00007f6f77a24370 in syncenv_processor (thdata=0x13bfe90) at
>>> syncop.c:613
>>> #3  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
>>> #4  0x00007f6f766a7ded in clone () from /lib64/libc.so.6
>>>
>>> Thread 2 (Thread 0x7f6f6564a700 (LWP 30546)):
>>> #0  0x00007f6f76d64d20 in pthread_cond_wait@@GLIBC_2.3.2 () from
>>> /lib64/libpthread.so.0
>>> #1  0x00007f6f6ca7a183 in hooks_worker (args=<optimized out>) at
>>> glusterd-hooks.c:502
>>> #2  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
>>> #3  0x00007f6f766a7ded in clone () from /lib64/libc.so.6
>>>
>>> Thread 1 (Thread 0x7f6f77e7c740 (LWP 30416)):
>>> #0  0x00007f6f766e26bf in _dl_addr () from /lib64/libc.so.6
>>> #1  0x00007f6f766b67d5 in backtrace_symbols_fd () from /lib64/libc.so.6
>>> #2  0x00007f6f779fdc88 in gf_backtrace_fillframes (
>>>     buf=buf at entry=0x1396378 "(-->
>>> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15c)[0x7f6f779ea1bc]
>>> (-->
>>> /usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x1c2)[0x7f6f777bd942]
>>> (--> /usr/local/lib/libgfrpc.so.0(saved_fram"...) at common-utils.c:3335
>>> #3  0x00007f6f77a03631 in gf_backtrace_save (buf=buf at entry=0x0) at
>>> common-utils.c:3391
>>> #4  0x00007f6f779ea1bc in _gf_log_callingfn (domain=0x13e62b0 "management",
>>>     file=file at entry=0x7f6f777c4746 "rpc-clnt.c",
>>>     function=function at entry=0x7f6f777c4ff0 <__FUNCTION__.13438>
>>> "saved_frames_unwind",
>>> ---Type <return> to continue, or q <return> to quit---
>>>     line=line at entry=362, level=level at entry=GF_LOG_ERROR,
>>>     fmt=fmt at entry=0x7f6f777c4930 "forced unwinding frame type(%s)
>>> op(%s(%d)) called at %s (xid=0x%x)") at logging.c:822
>>> #5  0x00007f6f777bd942 in saved_frames_unwind
>>> (saved_frames=saved_frames at entry=0x13e71e0)
>>>     at rpc-clnt.c:353
>>> #6  0x00007f6f777bda3e in saved_frames_destroy (frames=0x13e71e0) at
>>> rpc-clnt.c:383
>>> #7  0x00007f6f777bf22c in rpc_clnt_connection_cleanup
>>> (conn=conn at entry=0x13e3290)
>>>     at rpc-clnt.c:536
>>> #8  0x00007f6f777bf9c8 in rpc_clnt_notify (trans=<optimized out>,
>>> mydata=0x13e3290,
>>>     event=<optimized out>, data=<optimized out>) at rpc-clnt.c:843
>>> #9  0x00007f6f777bbd33 in rpc_transport_notify (this=this at entry=0x13e62f0,
>>>     event=event at entry=RPC_TRANSPORT_DISCONNECT, data=data at entry=0x13e62f0)
>>>     at rpc-transport.c:516
>>> #10 0x00007f6f66850962 in socket_event_poll_err (this=0x13e62f0) at
>>> socket.c:1161
>>> #11 socket_event_handler (fd=<optimized out>, idx=3,
>>> data=data at entry=0x13e62f0, poll_in=1,
>>>     poll_out=4, poll_err=<optimized out>) at socket.c:2353
>>> #12 0x00007f6f77a3dd82 in event_dispatch_epoll_handler (i=<optimized
>>> out>, events=0x13e0700,
>>>     event_pool=0x13b43c0) at event-epoll.c:384
>>> #13 event_dispatch_epoll (event_pool=0x13b43c0) at event-epoll.c:445
>>> #14 0x0000000000404e89 in main (argc=1, argv=0x7fff7a929408) at
>>> glusterfsd.c:2046
>>>
>>> From the backtrace I could figure out problematic area is in Thread 1 &
>>> 4, probably a deadlock. Need to analyse it, I will be in a training for
>>> the entire week, will jump into it in early next week.
>>>
>>> ~Atin
>>>
>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 


More information about the Gluster-devel mailing list