[Gluster-devel] Regression-test-burn-in crash in EC test
Xavier Hernandez
xhernandez at datalab.es
Fri Apr 29 08:06:36 UTC 2016
Hi Jeff,
On 27/04/16 20:01, Jeff Darcy wrote:
> One of the "rewards" of reviewing and merging people's patches is getting email if the next regression-test-burn-in should fail - even if it fails for a completely unrelated reason. Today I got one that's not among the usual suspects. The failure was a core dump in tests/bugs/disperse/bug-1304988.t, weighing in at a respectable 42 frames.
>
> #0 0x00007fef25976cb9 in dht_rename_lock_cbk
> #1 0x00007fef25955f62 in dht_inodelk_done
> #2 0x00007fef25957352 in dht_blocking_inodelk_cbk
> #3 0x00007fef32e02f8f in default_inodelk_cbk
> #4 0x00007fef25c029a3 in ec_manager_inodelk
> #5 0x00007fef25bf9802 in __ec_manager
> #6 0x00007fef25bf990c in ec_manager
> #7 0x00007fef25c03038 in ec_inodelk
> #8 0x00007fef25bee7ad in ec_gf_inodelk
> #9 0x00007fef25957758 in dht_blocking_inodelk_rec
> #10 0x00007fef25957b2d in dht_blocking_inodelk
> #11 0x00007fef2597713f in dht_rename_lock
> #12 0x00007fef25977835 in dht_rename
> #13 0x00007fef32e0f032 in default_rename
> #14 0x00007fef32e0f032 in default_rename
> #15 0x00007fef32e0f032 in default_rename
> #16 0x00007fef32e0f032 in default_rename
> #17 0x00007fef32e0f032 in default_rename
> #18 0x00007fef32e07c29 in default_rename_resume
> #19 0x00007fef32d8ed40 in call_resume_wind
> #20 0x00007fef32d98b2f in call_resume
> #21 0x00007fef24cfc568 in open_and_resume
> #22 0x00007fef24cffb99 in ob_rename
> #23 0x00007fef24aee482 in mdc_rename
> #24 0x00007fef248d68e5 in io_stats_rename
> #25 0x00007fef32e0f032 in default_rename
> #26 0x00007fef2ab1b2b9 in fuse_rename_resume
> #27 0x00007fef2ab12c47 in fuse_fop_resume
> #28 0x00007fef2ab107cc in fuse_resolve_done
> #29 0x00007fef2ab108a2 in fuse_resolve_all
> #30 0x00007fef2ab10900 in fuse_resolve_continue
> #31 0x00007fef2ab0fb7c in fuse_resolve_parent
> #32 0x00007fef2ab1077d in fuse_resolve
> #33 0x00007fef2ab10879 in fuse_resolve_all
> #34 0x00007fef2ab10900 in fuse_resolve_continue
> #35 0x00007fef2ab0fb7c in fuse_resolve_parent
> #36 0x00007fef2ab1077d in fuse_resolve
> #37 0x00007fef2ab10824 in fuse_resolve_all
> #38 0x00007fef2ab1093e in fuse_resolve_and_resume
> #39 0x00007fef2ab1b40e in fuse_rename
> #40 0x00007fef2ab2a96a in fuse_thread_proc
> #41 0x00007fef3204daa1 in start_thread
>
> In other words we started at FUSE, went through a bunch of performance translators, through DHT to EC, and then crashed on the way back. It seems a little odd that we turn the fop around immediately in EC, and that we have default_inodelk_cbk at frame 3. Could one of the DHT or EC people please take a look at it? Thanks!
The part regarding to ec seems ok. This is uncommon, but can happen.
When ec_gf_inodelk() is called, it sends a inodelk request to all its
subvolumes. It may happen that the callbacks of all these requests are
received before returning from ec_gf_inodelk() itself. This executes the
callback inside the same thread of the caller.
The reason why default_inodelk_cbk() is seen is because ec uses this
function to report the result back to the caller (instead of calling
STACK_UNWIND() itself).
This seems what have happened here.
The frames returned by ec to upper xlators are the same used by them
(the frame in dht_blocking_lock() is the same that receives
dht_blocking_inodelk_cbk()) and ec doesn't touch them, however the frame
at 0x7fef1003ca5c is absolutely corrupted.
We can see the call state from the core:
(gdb) f 4
#4 0x00007fef25c029a3 in ec_manager_inodelk (fop=0x7fef1000d37c,
state=5) at
/home/jenkins/root/workspace/regression-test-burn-in/xlators/cluster/ec/src/ec-locks.c:645
645 fop->cbks.inodelk(fop->req_frame, fop, fop->xl,
(gdb) print fop->answer
$30 = (ec_cbk_data_t *) 0x7fef180094ac
(gdb) print fop->answer->op_ret
$31 = 0
(gdb) print fop->answer->op_errno
$32 = 0
(gdb) print fop->answer->count
$33 = 6
(gdb) print fop->answer->mask
$34 = 63
As we can see there's an actual answer to the request with a success
result (op_ret == 0 and op_errno == 0) composed of the combination of
answers from 6 subvolumes (count == 6).
Looking at the dht code I have been unable to see any possible cause either.
The test is doing renames where source and target directories are
different. At the same time a new ec-set is added and rebalance started.
Rebalance will cause dht to also move files between bricks. Maybe this
is causing some race in dht ?
I'll try to continue investigating when I have some time.
Xavi
>
>
> https://build.gluster.org/job/regression-test-burn-in/868/console
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
More information about the Gluster-devel
mailing list