[Bugs] [Bug 1159269] New: Random crashes when generating an internal state dump with signal USR1
bugzilla at redhat.com
bugzilla at redhat.com
Fri Oct 31 10:47:54 UTC 2014
https://bugzilla.redhat.com/show_bug.cgi?id=1159269
Bug ID: 1159269
Summary: Random crashes when generating an internal state dump
with signal USR1
Product: GlusterFS
Version: mainline
Component: core
Severity: high
Assignee: bugs at gluster.org
Reporter: xhernandez at datalab.es
CC: bugs at gluster.org, gluster-bugs at redhat.com
Description of problem:
Sometimes a segmentation fault is generated while dumping internal state. An
analysis of the core dump seems to indicate that the bug is caused by an
unaligned structure:
In gf_proc_dump_call_frame() a copy of the frame is made inside a locked
region:
88 ret = TRY_LOCK(&call_frame->lock);
89 if (ret)
90 goto out;
91
92 memcpy(&my_frame, call_frame, sizeof(my_frame));
93 UNLOCK(&call_frame->lock);
call_frame->lock does not protect most of the updates to fields inside the
call_frame_t structure, specially the pointers to wind_from, wind_to,
unwind_from and unwind_to modified in macros STACK_WIND and STACK_UNWIND.
This shouldn't be a problem if all these updates were atomic, however it seems
that the memory pool framework can return unaligned pointers (at least on
64-bits architectures):
(gdb) print call_frame
$19 = (call_frame_t *) 0x7f4609a141c4
This means that all pointers inside the structure can be unaligned:
(gdb) print &call_frame->unwind_from
$20 = (const char **) 0x7f4609a14244
Translated to the microprocessor level, this means that a modification of the
unwind_from field will need 2 memory access cycles making the update non atomic
and prone to partial reads by other threads.
In fact this seems to be what happened:
(gdb) print *call_frame
$21 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0x0,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 1, cookie = 0x9, complete = _gf_true, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk",
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}
(gdb) print my_frame
$22 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0xb6a0b4,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 0, cookie = 0x9, complete = _gf_false, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f4500000000 <error: Cannot access memory at address
0x7f4500000000>,
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}
The copy made to my_frame has only copied half of the unwind_from pointer
because it was being updated in another thread. If we check current contents of
call_frame, we can see that the pointer has completed to be updated before
crashing, but the copy on my_frame remains incorrect:
(gdb) print call_frame->unwind_from
$23 = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk"
(gdb) print my_frame.unwind_from
$24 = 0x7f4500000000 <error: Cannot access memory at address 0x7f4500000000>
Version-Release number of selected component (if applicable): master
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list