[Bugs] [Bug 1396778] New: 4 of 8 bricks (2 dht subvols) crashed on systemic setup

bugzilla at redhat.com bugzilla at redhat.com
Sun Nov 20 05:34:35 UTC 2016


https://bugzilla.redhat.com/show_bug.cgi?id=1396778

            Bug ID: 1396778
           Summary: 4 of 8 bricks (2 dht subvols) crashed  on systemic
                    setup
           Product: GlusterFS
           Version: 3.9
         Component: posix
          Keywords: Triaged
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: pkarampu at redhat.com
                CC: bugs at gluster.org, mzywusko at redhat.com,
                    nchilaka at redhat.com, pkarampu at redhat.com,
                    rgowdapp at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com, vbellur at redhat.com
        Depends On: 1385606, 1386097



+++ This bug was initially created as a clone of Bug #1386097 +++

+++ This bug was initially created as a clone of Bug #1385606 +++

Description of problem:
=========================

I saw that 4 of the 8 bricks in a distrepvol crashed simultaneously.
The four bricks are part of 2 dht subvols (complete dht subvol 1 and 2 )


(IO : for more information on exact IO refer to the work-sheet "IOs" in the
google doc shared)
On the server side the below actions were running(in screen sessions):
1) heal info --xml ====>for last 4 days for which I am waiting for o/p (refer
BZ#1382686)
2) healing is going on from about 3 hours back as bricks were brought online
after about a week
3)snapshot is scheduled every 1 hour

Client side IO:
=============
from 4 clients : lookups using ls -lRt
for the same 4 clients: symlinks were being created for same target directories
>From another 4 clients: 2 clients are creating same directory structure , while
other 2 are renaming directories
from another 2 client2: append to same file




Backtrace using gdb:

(gdb) bt
#0  0x00007fe1e15cbab4 in vfprintf () from /lib64/libc.so.6
#1  0x00007fe1e168ee25 in __vsnprintf_chk () from /lib64/libc.so.6
#2  0x00007fe1e2ef9598 in vsnprintf (__ap=0x7fe152cf8a70, __fmt=<optimized
out>, __n=0, __s=0x0)
    at /usr/include/bits/stdio2.h:77
#3  gf_vasprintf (string_ptr=string_ptr at entry=0x7fe152cf8b78, 
    format=format at entry=0x7fe1d52c01b0 "op=%s;path=%s;error=%s;brick=%s:%s", 
    arg=arg at entry=0x7fe152cf8b90) at mem-pool.c:219
#4  0x00007fe1e2f482da in gf_event
(event=event at entry=EVENT_POSIX_HEALTH_CHECK_FAILED, 
    fmt=fmt at entry=0x7fe1d52c01b0 "op=%s;path=%s;error=%s;brick=%s:%s") at
events.c:84
#5  0x00007fe1d52b7660 in posix_fs_health_check
(this=this at entry=0x7fe1d0006dd0)
    at posix-helpers.c:1795
#6  0x00007fe1d52b77e4 in posix_health_check_thread_proc (data=0x7fe1d0006dd0)
    at posix-helpers.c:1833
#7  0x00007fe1e1d34dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fe1e1679ced in clone () from /lib64/libc.so.6



Probable Root cause(based on initial findings by Raghavendra Gowdappa)
===================================================================
op_errno is supposed to be an integer but is being assigned a string


1794                            "%s() on %s returned", op, file_path);
1795                    gf_event (EVENT_POSIX_HEALTH_CHECK_FAILED,
1796                              "op=%s;path=%s;error=%s;brick=%s:%s", op,
file_path,
1797                              op_errno, priv->hostname, priv->base_path);
1798            }


Brick logs:
==============



tat on parent /rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output
error]
[2016-10-17 11:26:48.124026] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink [Input/output error]
[2016-10-17 11:26:48.124037] E [MSGID: 113018] [posix.c:237:posix_lookup]
0-distrepvol-posix: post-operation lstat on parent
/rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output error]
[2016-10-17 11:26:48.124051] E [MSGID: 115050]
[server-rpc-fops.c:158:server_lookup_cbk] 0-distrepvol-server: 2701980: LOOKUP
/rootdir1/symlink/file.559585
(603542a5-8221-4bde-8869-09f0167ecb80/file.559585) ==> (Input/output error)
[Input/output error]
[2016-10-17 11:26:48.124075] E [MSGID: 115050]
[server-rpc-fops.c:158:server_lookup_cbk] 0-distrepvol-server: 2653096: LOOKUP
/rootdir1/symlink/file.561231
(603542a5-8221-4bde-8869-09f0167ecb80/file.561231) ==> (Input/output error)
[Input/output error]
[2016-10-17 11:26:48.124326] W [MSGID: 113075]
[posix-helpers.c:1794:posix_fs_health_check] 0-distrepvol-posix: open() on
/rhs/brick1/distrepvol/.glusterfs/health_check returned [Input/output error]
[2016-10-17 11:26:48.124523] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink/file.520670 [Input/output error]
[2016-10-17 11:26:48.124549] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink/file.520670 [Input/output error]
[2016-10-17 11:26:48.124550] W [MSGID: 113018] [posix.c:199:posix_lookup]
0-distrepvol-posix: lstat on
/rhs/brick1/distrepvol/rootdir1/symlink/file.520670 failed [Input/output error]
[2016-10-17 11:26:48.124568] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink [Input/output error]
[2016-10-17 11:26:48.124593] E [MSGID: 113018] [posix.c:237:posix_lookup]
0-distrepvol-posix: post-operation lstat on parent
/rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output error]
pending frames:
frame : type(0) op(27)
frame : type(0) op(27)
frame : type(0) op(27)
patchset: git://git.gluster.com/glusterfs.git
[2016-10-17 11:26:48.124625] E [MSGID: 115050]
[server-rpc-fops.c:158:server_lookup_cbk] 0-distrepvol-server: 2757018: LOOKUP
/rootdir1/symlink/file.520670
(603542a5-8221-4bde-8869-09f0167ecb80/file.520670) ==> (Input/output error)
[Input/output error]
[2016-10-17 11:26:48.124978] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink/file.561236 [Input/output error]
[2016-10-17 11:26:48.125012] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink/file.561236 [Input/output error]
[2016-10-17 11:26:48.125025] W [MSGID: 113018] [posix.c:199:posix_lookup]
0-distrepvol-posix: lstat on
/rhs/brick1/distrepvol/rootdir1/symlink/file.561236 failed [Input/output error]
[2016-10-17 11:26:48.125035] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink [Input/output error]
[2016-10-17 11:26:48.125042] E [MSGID: 113018] [posix.c:237:posix_lookup]
0-distrepvol-posix: post-operation lstat on parent
/rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output error]
[2016-10-17 11:26:48.125057] W [MSGID: 113018]
[posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on
/rhs/brick1/distrepvol/rootdir1/symlink/file.559585 [Input/output error]
signal received: 11
time of crash: 
2016-10-17 11:26:48
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f9e28ae3832]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f9e28aed2c4]
/lib64/libc.so.6(+0x35670)[0x7f9e271c8670]
/lib64/libc.so.6(_IO_vfprintf+0x1564)[0x7f9e271dbab4]
/lib64/libc.so.6(__vsnprintf_chk+0x95)[0x7f9e2729ee25]
/lib64/libglusterfs.so.0(gf_vasprintf+0x68)[0x7f9e28b09598]
/lib64/libglusterfs.so.0(gf_event+0x1aa)[0x7f9e28b582da]
/usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so(+0x29660)[0x7f9e1aec7660]
/usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so(+0x297e4)[0x7f9e1aec77e4]
/lib64/libpthread.so.0(+0x7dc5)[0x7f9e27944dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f9e27289ced]
---------

--- Additional comment from Worker Ant on 2016-10-18 06:55:48 EDT ---

REVIEW: http://review.gluster.org/15671 (events: Add FMT_WARN for gf_event)
posted (#1) for review on master by Pranith Kumar Karampuri
(pkarampu at redhat.com)

--- Additional comment from Worker Ant on 2016-10-18 14:26:53 EDT ---

REVIEW: http://review.gluster.org/15671 (events: Add FMT_WARN for gf_event)
posted (#2) for review on master by Pranith Kumar Karampuri
(pkarampu at redhat.com)

--- Additional comment from Worker Ant on 2016-11-09 11:11:41 EST ---

REVIEW: http://review.gluster.org/15671 (events: Add FMT_WARN for gf_event)
posted (#3) for review on master by Pranith Kumar Karampuri
(pkarampu at redhat.com)

--- Additional comment from Worker Ant on 2016-11-18 04:56:43 EST ---

COMMIT: http://review.gluster.org/15671 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 5310be8838f8db748a698bd3a98f8d00a4114e65
Author: Pranith Kumar K <pkarampu at redhat.com>
Date:   Tue Oct 18 15:16:17 2016 +0530

    events: Add FMT_WARN for gf_event

    Raghavendra G found that posix is trying to print %s
    but passing an int when HEALTH_CHECK fails in posix.
    These are the kind of bugs that should be caught
    at compilation itself.
    Also fixed the problematic gf_event() callers.

    BUG: 1386097
    Change-Id: Id7bd6d9a9690237cec3ca1aefa2aac085e8a1270
    Signed-off-by: Pranith Kumar K <pkarampu at redhat.com>
    Reviewed-on: http://review.gluster.org/15671
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Reviewed-by: Atin Mukherjee <amukherj at redhat.com>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1385606
[Bug 1385606] 4 of 8 bricks (2 dht subvols) crashed  on systemic setup
https://bugzilla.redhat.com/show_bug.cgi?id=1386097
[Bug 1386097] 4 of 8 bricks (2 dht subvols) crashed  on systemic setup
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list