[Bugs] [Bug 1401921] glusterfsd crashed while taking snapshot using scheduler

Tue Dec 6 12:15:00 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1401921

--- Comment #1 from Atin Mukherjee <amukherj at redhat.com> ---
Description of problem:

While taking snapshot using scheduler one of the brick process crashed.

Version-Release number of selected component (if applicable):
mainline

How reproducible:

1/1

Steps to Reproduce:
1. Create 2*2 distributed replicate volume
2. enabled scheduler,
3. scheduled snapshot every one minute

Actual results:

One of the brick process crashed

Additional info:

bt
=======================

#0  0x00007f19a2a12394 in glusterfs_handle_barrier (req=0x7f19a30cffcc) at
glusterfsd-mgmt.c:1348
        ret = <optimized out>
        brick_req = {name = 0x7f198c0008e0 "repvol", op = 10, input =
{input_len = 1783, 
            input_val = 0x7f198c000900 ""}}
        brick_rsp = {op_ret = 0, op_errno = 0, output = {output_len = 0,
output_val = 0x0}, op_errstr = 0x0}
        ctx = 0x7f19a3085010
        active = 0x0
        any = 0x0
        xlator = 0x0
        old_THIS = 0x0
        dict = 0x0
        name = '\000' <repeats 1023 times>
        barrier = _gf_true
        barrier_err = _gf_false
        __FUNCTION__ = "glusterfs_handle_barrier"
#1  0x00007f19a2550a92 in synctask_wrap (old_task=<optimized out>) at
syncop.c:375
        task = 0x7f1990002510
#2  0x00007f19a0c0fcf0 in ?? () from /lib64/libc.so.6
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.

RCA: 
The function from where this core was generated is glusterfs_handle_barrier ().
>From the core it looks like glusterfsd_ctx (global context) in the brick
process didn't have ctx->active initialized which happens during graph
initialization. We also saw that when barrier brick op was sent by GlusterD
brick process just came up. The hypothesis we have here is as follows:

T1. Brick process was in its init. However it still didn't finish doing the
graph generation.
T2. GlusterD sent a barrier brick op (as a trigger to snapshot initiated by
snapshot scheduler) as it understood the brick to be connected (received the
rpc connect notify from brick process)

The time gap between T1 & T2 is very minimum and currently GlusterD doesn't
know whether the brick process has finished all its initialization including
the graph generation.

One mitigation approach to avoid this crash is to avoid null pointer
dereferencing which can be addressed by a simple patch and then even if we hit
this race, barrier would fail. But to fix this race entirely we need to come up
with a concrete solution which may not be feasible in 3.2.0 time lines.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.