[Bugs] [Bug 1159284] New: Random crashes when generating an internal state dump with signal USR1

Fri Oct 31 11:41:39 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1159284

            Bug ID: 1159284
           Summary: Random crashes when generating an internal state dump
                    with signal USR1
           Product: GlusterFS
           Version: 3.6.0
         Component: core
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: xhernandez at datalab.es
                CC: bugs at gluster.org, gluster-bugs at redhat.com

+++ This bug was initially created as a clone of Bug #1159269 +++

Description of problem:

Sometimes a segmentation fault is generated while dumping internal state. An
analysis of the core dump seems to indicate that the bug is caused by an
unaligned structure:

In gf_proc_dump_call_frame() a copy of the frame is made inside a locked
region:

88              ret = TRY_LOCK(&call_frame->lock);
89              if (ret)
90                      goto out;
91
92              memcpy(&my_frame, call_frame, sizeof(my_frame));
93              UNLOCK(&call_frame->lock);

call_frame->lock does not protect most of the updates to fields inside the
call_frame_t structure, specially the pointers to wind_from, wind_to,
unwind_from and unwind_to modified in macros STACK_WIND and STACK_UNWIND.

This shouldn't be a problem if all these updates were atomic, however it seems
that the memory pool framework can return unaligned pointers (at least on
64-bits architectures):

(gdb) print call_frame
$19 = (call_frame_t *) 0x7f4609a141c4

This means that all pointers inside the structure can be unaligned:

(gdb) print &call_frame->unwind_from
$20 = (const char **) 0x7f4609a14244

Translated to the microprocessor level, this means that a modification of the
unwind_from field will need 2 memory access cycles making the update non atomic
and prone to partial reads by other threads.

In fact this seems to be what happened:

(gdb) print *call_frame
$21 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0x0,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 1, cookie = 0x9, complete = _gf_true, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk",
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}
(gdb) print my_frame
$22 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0xb6a0b4,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 0, cookie = 0x9, complete = _gf_false, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f4500000000 <error: Cannot access memory at address
0x7f4500000000>,
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}

The copy made to my_frame has only copied half of the unwind_from pointer
because it was being updated in another thread. If we check current contents of
call_frame, we can see that the pointer has completed to be updated before
crashing, but the copy on my_frame remains incorrect:

(gdb) print call_frame->unwind_from
$23 = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk"
(gdb) print my_frame.unwind_from
$24 = 0x7f4500000000 <error: Cannot access memory at address 0x7f4500000000> 

Version-Release number of selected component (if applicable): master

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

--- Additional comment from Anand Avati on 2014-10-31 12:33:40 CET ---

REVIEW: http://review.gluster.org/9031 (mem-pool: Fix memory block alignments)
posted (#1) for review on master by Xavier Hernandez (xhernandez at datalab.es)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.