[Gluster-devel] netbsd regression update : georep-setup.t

Sat May 2 15:07:48 UTC 2015

On 05/02/2015 11:59 AM, Atin Mukherjee wrote:
> 
> 
> On 05/02/2015 09:08 AM, Atin Mukherjee wrote:
>>
>>
>> On 05/02/2015 08:54 AM, Emmanuel Dreyfus wrote:
>>> Pranith Kumar Karampuri <pkarampu at redhat.com> wrote:
>>>
>>>> Seems like glusterd failure from the looks of it: +glusterd folks.
>>>>
>>>> Running tests in file ./tests/basic/cdc.t
>>>> volume delete: patchy: failed: Another transaction is in progress for
>>>> patchy. Please try again after sometime.
>>>> [18:16:40] ./tests/basic/cdc.t ..
>>>> not ok 52
>>>
>>> This is a volume stop that fails. Logs says a lock is held by an UUID
>>> which happeens to be the volume's own UUID. 
>>>
>>> I tried git bisect and it seems to be related to
>>> http://review.gluster.org/9918 but I am not completely sure (I may have
>>> botched by git bisect)
>>
>> I'm looking into this.
> Looking at the logs, here is the findings:
> 
> - gluster volume stop got timed out at cli because of which
> cmd_history.log didn't capture it.
> - glusterd acquired the volume lock in volume stop but didn't release it
> somehow as gluster v delete failed saying another transaction is in progress
> - For gluster volume stop transaction I could see glusterd_nfssvc_stop
> was triggered but after that it didn't log anything for almost two
> minutes, but catching point here is by this time volinfo->status should
> have been marked as stopped and persisted in the disk, but gluster v
> info didn't reflect the same.
> 
> Is this reproducible in netbsd everytime, if yes I would need a VM to
> further debug it. I am guessing that the reason of other failure from
> tests/geo-rep/georep-setup.t is same. Is it a new regression failure ?
Although I couldn't reproduce cdc.t failure but georep-setup.t failed
consistently and glusterd backtrace showed that it hangs on gverify.sh
when gsync_create is executed. Since this script was called in runner
framework, big lock was released by that time and the same thread didn't
acquire back the big lock and eventually didn't release the cluster wide
lock. Because of this, subsequent glusterd command failed with "another
transaction is in progress".

Ccing Geo-rep team for further analysis. Backtrace for your reference:

hread 3 (LWP 5):
#0  0xbb35e577 in _sys___wait450 () from /usr/lib/libc.so.12
#1  0xbb689e71 in __wait450 () from /usr/lib/libpthread.so.1
#2  0xbb3cba3b in waitpid () from /usr/lib/libc.so.12
#3  0xbb798f0f in runner_end_reuse (runner=0xb86fd828) at run.c:345
#4  0xbb798fa4 in runner_end (runner=0xb86fd828) at run.c:366
#5  0xbb799043 in runner_run_generic (runner=0xb86fd828, rfin=0xbb798f72
<runner_end>)
    at run.c:386
#6  0xbb799088 in runner_run (runner=0xb86fd828) at run.c:392
#7  0xb922d1dc in glusterd_verify_slave (volname=0xb8216cb0 "master",
    slave_url=0xb8201e90 "nbslave70.cloud.gluster.org",
slave_vol=0xb821b170 "slave",
    op_errstr=0xb86ff5ec, is_force_blocker=0xb86fd92c) at
glusterd-geo-rep.c:2075
#8  0xb922ddfb in glusterd_op_stage_gsync_create (dict=0xb9c07ad0,
op_errstr=0xb86ff5ec)
    at glusterd-geo-rep.c:2300
#9  0xb91cfcb6 in glusterd_op_stage_validate (op=GD_OP_GSYNC_CREATE,
dict=0xb9c07ad0,
    op_errstr=0xb86ff5ec, rsp_dict=0xb9c07b80) at glusterd-op-sm.c:4932
#10 0xb9255a34 in gd_stage_op_phase (op=GD_OP_GSYNC_CREATE,
op_ctx=0xb9c077b8,
    req_dict=0xb9c07ad0, op_errstr=0xb86ff5ec, txn_opinfo=0xb86ff598)
    at glusterd-syncop.c:1182
#11 0xb92570d3 in gd_sync_task_begin (op_ctx=0xb9c077b8, req=0xb8f40040)
    at glusterd-syncop.c:1745
#12 0xb9257309 in glusterd_op_begin_synctask (req=0xb8f40040,
op=GD_OP_GSYNC_CREATE,
    dict=0xb9c077b8) at glusterd-syncop.c:1804
#13 0xb9227bbc in __glusterd_handle_gsync_set (req=0xb8f40040) at
glusterd-geo-rep.c:334
#14 0xb91b29c1 in glusterd_big_locked_handler (req=0xb8f40040,
    actor_fn=0xb92275e4 <__glusterd_handle_gsync_set>) at
glusterd-handler.c:83
#15 0xb9227cb9 in glusterd_handle_gsync_set (req=0xb8f40040) at
glusterd-geo-rep.c:362
#16 0xbb78992f in synctask_wrap (old_task=0xb8d3d000) at syncop.c:375
#17 0xbb385630 in ?? () from /usr/lib/libc.so.12

Emmanuel,

If you happen to see cdc.t failure again please ring a bell :)

~Atin
> 
> ~Atin
>>>
>>
> 

-- 
~Atin