[Bugs] [Bug 1214220] New: Crashes in logging code

Wed Apr 22 09:30:53 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1214220

            Bug ID: 1214220
           Summary: Crashes in logging code
           Product: GlusterFS
           Version: 3.7.0
         Component: core
          Keywords: Triaged
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: vbellur at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    gluster-bugs at redhat.com, jclift at redhat.com,
                    jdarcy at redhat.com, vbellur at redhat.com
        Depends On: 1212660
            Blocks: 1199352 (glusterfs-3.7.0)

+++ This bug was initially created as a clone of Bug #1212660 +++

I looked at seven core dumps from five recently failed regression tests. 
Here's a summary.

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7052/console
   generated by: tests/geo-rep/georep-rsync-hybrid.t
   crash details: in python (gsyncd)

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7038/console
   generated by: tests/basic/cdc.t
   crash details: in glusterfsd
      pthread_spin_lock
      __gf_free
      log_buf_destroy
      _gf_msg_internal
      _gf_msg "accepted client from %s (version: %s)"
      server_setvolume

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7035/console
   generated by: tests/basic/mgmt_v3-locks.t
   crash details: in glusterfs
      log_buf_destroy
      gf_log_flush_list
      gf_log_flush_extra_msgs
      gf_log_set_log_buf_size
      gf_log_disable_suppression_before_exit
      cleanup_and_exit
      glusterfs_process_volfp

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7030/console
   generated by: tests/basic/cdc.t
   crash details: in glusterfsd
      same as previous server_setvolume

   http://build.gluster.org/job/rackspace-regression-2GB-triggered/7029/console
   generated by: tests/basic/volume-snapshot-clone.t (three core files)
   crash details: in glusterfs
      all three same as previous glusterfs_process_volfp crash

That's six out of seven going through log_buf_destroy - different tests,
different daemons, different code paths, but all converging there.
Could it be a coincidence that this is the same logging infrastructure
we've recently started using more heavily?  That seems unlikely.  It's
entirely possible that log_buf_destroy is the victim (of heap
corruption) rather than a culprit, but chances are that the bug is
somewhere in related code.

--- Additional comment from Justin Clift on 2015-04-17 06:54:00 EDT ---

Cool, keep going.  Let's nail this sucker! :)

--- Additional comment from Jeff Darcy on 2015-04-21 11:26:43 EDT ---

This turns out to be a relative of both bug 1211749 and bug 1211473 - a memory
object allocated in a translator has persisted past the lifetime of that
translator.  The translator pointer in that memory object's header is therefore
no longer valid, and when the memory tracking code tries to dereference through
that pointer . . . BOOM.

In those other cases, the problem had to do with a temporary graph created for
option validation.  In this case it has to do with the list we use to detect
and coalesce duplicate log messages.  While the log_buf objects themselves are
allocated from a pool, various elements are copied via gf_strdup, using THIS
from the current context as the owning translator.  The solution is going to be
rather similar to that for 1211749:

    http://review.gluster.org/#/c/10238/

It's hacky, but it gets us past having our daemons blow up effectively at
random.

--- Additional comment from Anand Avati on 2015-04-21 11:50:53 EDT ---

REVIEW: http://review.gluster.org/10319 (core: avoid crashes in gf_msg
dup-detection code) posted (#1) for review on master by Jeff Darcy
(jdarcy at redhat.com)

--- Additional comment from Justin Clift on 2015-04-21 12:00:15 EDT ---

Awesome. :)

--- Additional comment from Anand Avati on 2015-04-22 02:15:43 EDT ---

COMMIT: http://review.gluster.org/10319 committed in master by Vijay Bellur
(vbellur at redhat.com) 
------
commit 765849ee00f6661c9059122ff2346b03b224745f
Author: Jeff Darcy <jdarcy at redhat.com>
Date:   Tue Apr 21 11:48:15 2015 -0400

    core: avoid crashes in gf_msg dup-detection code

    Use global_xlator for allocations so that we don't try to free objects
    belonging to an already-deleted translator (which will crash).

    Change-Id: Ie72a546e7770cf5cb8a8370e22448c8d09e3ab37
    BUG: 1212660
    Signed-off-by: Jeff Darcy <jdarcy at redhat.com>
    Reviewed-on: http://review.gluster.org/10319
    Reviewed-by: Krishnan Parthasarathi <kparthas at redhat.com>
    Tested-by: NetBSD Build System
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Krutika Dhananjay <kdhananj at redhat.com>
    Reviewed-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-by: Vijay Bellur <vbellur at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1199352
[Bug 1199352] GlusterFS 3.7.0 tracker
https://bugzilla.redhat.com/show_bug.cgi?id=1212660
[Bug 1212660] Crashes in logging code
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.