[Bugs] [Bug 1441476] New: Glusterd crashes when restarted with many volumes

Wed Apr 12 04:01:37 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1441476

            Bug ID: 1441476
           Summary: Glusterd crashes when restarted with many volumes
           Product: GlusterFS
           Version: 3.10
         Component: glusterd
          Assignee: bugs at gluster.org
          Reporter: jeff at pl.atyp.us
                CC: amukherj at redhat.com, bugs at gluster.org,
                    jdarcy at redhat.com
        Depends On: 1432542
            Blocks: 1438052

+++ This bug was initially created as a clone of Bug #1432542 +++

This was actually found in a test for bug 1430860, which was about *glusterfsd*
crashing under the same load.  That test never reproduced the original bug, but
it could reliably reproduce this one.  Examples include:

  https://build.gluster.org/job/centos6-regression/3607/consoleFull
  two crashes in attach_brick/send_attach_req  

  https://build.gluster.org/job/centos6-regression/3608/consoleFull
  first crash in attach_brick/send_attach_req
  second crash in __synclock_unlock/list_del_init

  https://build.gluster.org/job/centos6-regression/3609/consoleFull
  one crash in __synclock_unlock/list_del_init

Investigation showed that the problem was with the lock manipulation done
around retries in attach_brick.  Turns out that's not safe after all,
especially when it allows a new operation to overlap with one already in
progress - as the test does.  A fix, which moves retries into a separate thread
and tracks connection states more carefully, is forthcoming shortly.

--- Additional comment from Worker Ant on 2017-03-15 11:43:13 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#1) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 13:08:31 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#2) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 13:52:18 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#3) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 14:35:48 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#4) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 15:06:05 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#5) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 15:29:01 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#6) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 15:53:35 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#7) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 16:21:23 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#8) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 16:43:02 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#9) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 17:12:46 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#10) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Jeff Darcy on 2017-03-15 18:49:06 EDT ---

Down the rabbit hole we go.

Moving the retries to a separate thread made the attach_brick/send_attach_req
crashes go away, but the __synclock_unlock/list_del_init crashes were still
there.  It's hard to tell from a small sample, but if anything they seemed to
be worse.  After looking at several cores and then running some experiments, I
now strongly suspect that there's a bug in our synclock_lock/synclock_unlock
code.  I'm sure that will come as a shock to just about nobody.

One of the hallmarks of these crashes is that a synclock's waitq - specifically
the one for glusterd's big_lock - has one task on it, but that task has clearly
been freed.  0xdeadcode all over it, indicating that not only has it been freed
but parts of it have since been reallocated.  Somehow the problem seems to be
related to the fact that __synclock_unlock will wake up both a waiting thread
and a waiting synctask if both exist.  If I change it to wake up only one,
giving preference to synctasks, then the crashes go away but I hit some sort of
deadlock (hard to diagnose precisely when none of this has ever been
reproducible other than on our regression machines).  If I change it to wake up
only one, preferring the actual thread, then things seem much better.  The
locking etc. in the synclock_lock/synclock_unlock code *looks* reasonable, but
there must be either some gap or some piece of code that's going around it.

For now, the threads-get-priority change as implemented in patchset 10 seems to
be avoiding the problem, but if we don't finish tracking it down then there's a
high probability we'll just hit it again in some other context some day.

--- Additional comment from Worker Ant on 2017-03-15 20:39:53 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#11) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-15 22:50:02 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#12) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Jeff Darcy on 2017-03-15 23:08:14 EDT ---

OK, latest theory is that someone is calling synctask_set to set the task
pointer in thread-local storage (TLS), then the task exits but the task pointer
in TLS is still set to the now-freed task structure.  Subsequently, someone
else calls into synclock_lock, and mistakenly believes that we're in that
now-deleted task because that's what synctask_get told us.  So we put the bogus
task on the waitq, and everything goes downhill from there.  I'm not *exactly*
sure how we get to task exit before the next call to synctask_set, but there
are lots of paths through this code and the fact that it's jumping between
tasks and threads makes it even harder to follow.  What I do know is that
better hygiene around calling synctask_set - including synctask_set(NULL) -
whenever we call swapcontext seems to have helped.

--- Additional comment from Worker Ant on 2017-03-16 08:56:50 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#13) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-16 11:14:53 EDT ---

REVIEW: https://review.gluster.org/16905 (glusterd: move attach-request retries
to a separate thread) posted (#14) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Jeff Darcy on 2017-03-18 23:58:47 EDT ---

Further investigation shows that all symptoms - including cases where I was
able to observe multiple threads/synctasks executing concurrently even though
they were both supposed to be under glusterd's big_lock, could be explained by
memory corruption.  What's happening is that the lock/unlock behavior in
attach_brick is allowing other operations to overlap.  In particular, if we're
calling attach_brick from glusterd_restart_bricks (as we do when starting up),
an overlapping volume delete will modify the very list we're in the middle of
walking, so we'll either be working on a pointer to a freed brickinfo_t or
we'll follow its list pointers off into space.  Either way, *all sorts* of
crazy and horrible things can happen.

Moving the attach-RPC retries into a separate thread/task really doesn't help,
and introduces its own problems.  What has worked fairly well is forcing volume
deletes to wait if glusterd_restart_bricks is still in progress, but the
approach I'm currently using is neither as complete nor as airtight as it
really needs to be.  I'm still trying to think of a better solution, but at
least now it's clear what the problem is.

BTW, debugging and testing is being severely hampered by constant
disconnections between glusterd and glusterfsd.  This might be related to the
sharp increase in client/server disconnect/reconnect cycles that others have
reported dating back approximately to RHGS 3.1.2, and we really need to figure
out why our RPC got flakier around that time.

--- Additional comment from Worker Ant on 2017-03-20 12:45:19 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes
while still restarting bricks) posted (#1) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-20 15:06:14 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes
while still restarting bricks) posted (#2) for review on master by Jeff Darcy
(jeff at pl.atyp.us)

--- Additional comment from Worker Ant on 2017-03-20 17:19:35 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes
while still restarting bricks) posted (#3) for review on master by Jeff Darcy
(jeff at pl.atyp.us)

--- Additional comment from Worker Ant on 2017-03-23 08:51:28 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes
while still restarting bricks) posted (#4) for review on master by Jeff Darcy
(jeff at pl.atyp.us)

--- Additional comment from Worker Ant on 2017-03-23 14:30:41 EDT ---

REVIEW: https://review.gluster.org/16927 (glusterd: hold off volume deletes
while still restarting bricks) posted (#5) for review on master by Jeff Darcy
(jeff at pl.atyp.us)

--- Additional comment from Worker Ant on 2017-03-30 23:33:35 EDT ---

COMMIT: https://review.gluster.org/16927 committed in master by Atin Mukherjee
(amukherj at redhat.com) 
------
commit a7ce0548b7969050644891cd90c0bf134fa1594c
Author: Jeff Darcy <jdarcy at redhat.com>
Date:   Mon Mar 20 12:32:33 2017 -0400

    glusterd: hold off volume deletes while still restarting bricks

    We need to do this because modifying the volume/brick tree while
    glusterd_restart_bricks is still walking it can lead to segfaults.
    Without waiting we could accidentally "slip in" while attach_brick has
    released big_lock between retries and make such a modification.

    Change-Id: I30ccc4efa8d286aae847250f5d4fb28956a74b03
    BUG: 1432542
    Signed-off-by: Jeff Darcy <jeff at pl.atyp.us>
    Reviewed-on: https://review.gluster.org/16927
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Atin Mukherjee <amukherj at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1432542
[Bug 1432542] Glusterd crashes when restarted with many volumes
https://bugzilla.redhat.com/show_bug.cgi?id=1438052
[Bug 1438052] Glusterd crashes when restarted with many volumes
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.