[Bugs] [Bug 1465861] Removal of io threads from graph causes segfault in quota enable volume

Wed Jun 28 11:44:44 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1465861

--- Comment #1 from Sanoj Unnikrishnan <sunnikri at redhat.com> ---


In attempts to repro this , I found that on each run some random structures
where getting corrupted and running into segfault.
In order to assert that the stack was indeed growing into all the allocated
space and beyond, I set a guard page in the end of the allocated stack space
(so that we hit a segfault before overusing the space).
Below are the code changes.

@@ -443,6 +443,8 @@ synctask_create (struct syncenv *env, size_t stacksize,
synctask_fn_t fn,
         struct synctask *newtask = NULL;
         xlator_t        *this    = THIS;
         int             destroymode = 0;
+        int                     r=0;
+        char                    *v;

         VALIDATE_OR_GOTO (env, err);
         VALIDATE_OR_GOTO (fn, err);
@@ -498,9 +500,15 @@ synctask_create (struct syncenv *env, size_t stacksize,
synctask_fn_t fn,
                                             gf_common_mt_syncstack);
                 newtask->ctx.uc_stack.ss_size = env->stacksize;
         } else {
-                newtask->stack = GF_CALLOC (1, stacksize,
+               newtask->stack = GF_CALLOC (1, stacksize,
                                             gf_common_mt_syncstack);
                 newtask->ctx.uc_stack.ss_size = stacksize;
+                if (stacksize == 16*1024) {
+                        v = (unsigned long)((char *)(newtask->stack) + 4095) &
(~4095);
+                        r = mprotect(v, 4096, PROT_NONE);
+                       gf_msg ("syncop", GF_LOG_ERROR, errno,
+                                LG_MSG_GETCONTEXT_FAILED, "SKU: using 16k
stack starting at %p, mprotect returned %d, guard page: %p", newtask->stack, r,
v);
+               }
         }

(gdb) where
#0  0x00007f8a92c51204 in _dl_lookup_symbol_x () from
/lib64/ld-linux-x86-64.so.2
#1  0x00007f8a92c561e3 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#2  0x00007f8a92c5dd33 in _dl_runtime_resolve_avx () from
/lib64/ld-linux-x86-64.so.2
#3  0x0000000000000000 in ?? ()


(gdb) info reg

rdi            0x7f8a92946188    140233141412232
rbp            0x7f8a800b4000    0x7f8a800b4000
rsp            0x7f8a800b4000    0x7f8a800b4000
r8             0x7f8a92e4ba60    140233146677856

(gdb) layout asm

  >│0x7f8a92c51204 <_dl_lookup_symbol_x+4>          push   %r15                
  <== push on stack at the guarded page caused segfault

>From the brick log we have,

[syncop.c:515:synctask_create] 0-syncop: SKU: using 16k stack starting at
0x7f8a800b28f0, mprotect returned 0, guard page: 0x7f8a800b3000 [No data
available]

Stack grows downward from 0x7f8a800b68f0 to 0x7f8a800b28f0  and the page
0x7f8a800b3000 - 0x7f8a800b4000 is guarded , which is where the segfault hit as
seen in gdb.

This confirms that the stack space is not sufficient and overflowing, 
I am not sure why we don't hit this in the presence of IO threads though, It
may just be that with io threads in graph we may have some allocated and unused
memory which our stack freely grows into.
It may just be a silent undetected reuse of some memory.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.