[Bugs] [Bug 1441474] New: synclocks don't work correctly under contention

bugzilla at redhat.com bugzilla at redhat.com
Wed Apr 12 03:57:10 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1441474

            Bug ID: 1441474
           Summary: synclocks don't work correctly under contention
           Product: GlusterFS
           Version: 3.10
         Component: core
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: jeff at pl.atyp.us
                CC: bugs at gluster.org, jdarcy at redhat.com
        Depends On: 1434062



+++ This bug was initially created as a clone of Bug #1434062 +++

The visible symptom of this is usually a crash in list_del_init called from
__synclock_unlock.  Further investigation will show that the waitq for the lock
argument points to a task structure that looks like freed memory (lots of
0xdeadcode all over the place).  That's how you recognize the problem, which is
probably sufficient for most people.  Here's some explanation of how we get
into such a state.

The key aspect of how synctasks work is that they have "slept" and "woken"
flags in addition to their main state (that's a mistake right there IMO). 
They're started quiescent, with slept=1 and woken=0, and don't actually run
until someone wakes them up.  Normally this happens right away in
synctask_create, but there are some other cases.  In particular, GlusterD often
calls synctask_new directly, thus not calling synctask_wake until later.  What
matters is that the "woken" flag exists to support this usage, but it's not
being handled correctly.

(1) Synctask X is holding big_lock.

(2) Synctask Y tries to take the lock, but instead puts itself on the lock's
waitq and yields back to the scheduler.

(3) Process Z (not a synctask) also tries to take the lock.  Because it's a
regular process, it calls pthread_cond_wait instead.

(4) X releases the lock.  One idiosyncrasy of synclocks is that this will cause
*both* Y and Z to be woken up.

(5) Z wins the race, so Y tries to go back to sleep the same way as before. 
However, this time when it yields it gets back to synctask_switchto with its
"woken" flag already set, so synctask_switchto puts it on the global run queue
*even though it's still on the lock's waitq*.

(6) Z releases the lock, waking up Y.  Calling list_del_init on Y's task
structure has *no effect* on the lock's waitq (because the structure only
points to the run queue at this point), so the waitq still points to it.

(7) Y runs, completes, and frees its task structure.

(8) W takes the lock.  Because the lock is now free, it doesn't care about the
waitq.

(9) W releases the lock, following the waitq pointer to Y's long-ago-freed task
structure.

The crux of the problem is that, once set, "woken" is never cleared. 
Therefore, any time we have to go around the loop in __synclock_lock more than
once, we trip at step 5 above.  This is exactly what happens when we have
contention between synctasks and regular threads, because of the double wakeup
at step 4.

After all that, the solution is simple: clear "woken" again each time we put
ourselves to sleep in __synctask_lock.  This gets us the right behavior in
synctask_switchto, and everything is good again.

BTW yes, this was hell to debug.

--- Additional comment from Worker Ant on 2017-03-20 12:45:17 EDT ---

REVIEW: https://review.gluster.org/16926 (core: fix synclocks' handling of
"woken" flag) posted (#1) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Worker Ant on 2017-03-20 17:19:32 EDT ---

REVIEW: https://review.gluster.org/16926 (core: fix synclocks' handling of
"woken" flag) posted (#2) for review on master by Jeff Darcy (jeff at pl.atyp.us)

--- Additional comment from Worker Ant on 2017-03-23 08:45:27 EDT ---

COMMIT: https://review.gluster.org/16926 committed in master by Jeff Darcy
(jeff at pl.atyp.us) 
------
commit 31377765dbbb8d49292c4362837a695adcbc6cb4
Author: Jeff Darcy <jdarcy at redhat.com>
Date:   Mon Mar 20 12:31:33 2017 -0400

    core: fix synclocks' handling of "woken" flag

    The "woken" flag wasn't being reset when it should have been, leading
    (eventually) to a SEGV when someone tried to folow a synclock's waitq
    to a task structure that had been freed while still on the queue.  See
    the bug report for (far) more detail.

    Change-Id: I5cd9ae1bcb831555274108b292181ec2a29b6d95
    BUG: 1434062
    Signed-off-by: Jeff Darcy <jeff at pl.atyp.us>
    Reviewed-on: https://review.gluster.org/16926
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Amar Tumballi <amarts at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1434062
[Bug 1434062] synclocks don't work correctly under contention
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list