[Bugs] [Bug 1651439] New: gluster-NFS crash while expanding volume

Tue Nov 20 05:56:00 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1651439

            Bug ID: 1651439
           Summary: gluster-NFS crash while expanding volume
           Product: GlusterFS
           Version: mainline
         Component: nfs
          Severity: medium
          Assignee: jthottan at redhat.com
          Reporter: jthottan at redhat.com
                CC: bugs at gluster.org, dang at redhat.com, ffilz at redhat.com,
                    grajoria at redhat.com, kkeithle at redhat.com,
                    mbenjamin at redhat.com, msaini at redhat.com,
                    rhs-bugs at redhat.com, sankarshan at redhat.com,
                    skoduri at redhat.com, storage-qa-internal at redhat.com,
                    vavuthu at redhat.com

+++ This bug was initially created as a clone of Bug #1633177 +++

Description of problem:

gluster-NFS is crashed while expanding volume

Version-Release number of selected component (if applicable):

glusterfs-3.12.2-18.1.el7rhgs.x86_64

How reproducible: 

Steps to Reproduce:

While running automation runs, gluster-NFS is crashed while expanding volume

1) create distribute volume ( 1 * 4 )
2) write IO from 2 clients
3) Add bricks while IO is in progress
4) start re-balance
5) check for IO 

After step 5), mount point is hung due to gluster-NFS crash.

Actual results:

gluster-NFS crash and IO is hung

Expected results:

IO should be success

Additional info:

volume info:

[root at rhsauto023 glusterfs]# gluster vol info

Volume Name: testvol_distributed
Type: Distribute
Volume ID: a809a120-f582-4358-8a70-5c53f71734ee
Status: Started
Snapshot Count: 0
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1:
rhsauto023.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick0
Brick2:
rhsauto030.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick1
Brick3:
rhsauto031.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick2
Brick4:
rhsauto027.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed_brick3
Brick5:
rhsauto023.lab.eng.blr.redhat.com:/bricks/brick1/testvol_distributed_brick4
Options Reconfigured:
transport.address-family: inet
nfs.disable: off
[root at rhsauto023 glusterfs]# 

> volume status

[root at rhsauto023 glusterfs]# gluster vol status
Status of volume: testvol_distributed
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick0      49153     0          Y       22557
Brick rhsauto030.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick1      49153     0          Y       21814
Brick rhsauto031.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick2      49153     0          Y       20441
Brick rhsauto027.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed_brick3      49152     0          Y       19886
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_distributed_brick4      49152     0          Y       23019
NFS Server on localhost                     N/A       N/A        N       N/A  
NFS Server on rhsauto027.lab.eng.blr.redhat
.com                                        2049      0          Y       20008
NFS Server on rhsauto033.lab.eng.blr.redhat
.com                                        2049      0          Y       19752
NFS Server on rhsauto030.lab.eng.blr.redhat
.com                                        2049      0          Y       21936
NFS Server on rhsauto031.lab.eng.blr.redhat
.com                                        2049      0          Y       20557
NFS Server on rhsauto040.lab.eng.blr.redhat
.com                                        2049      0          Y       20047

Task Status of Volume testvol_distributed
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 8e5b404f-5740-4d87-a0d7-3ce94178329f
Status               : completed           

[root at rhsauto023 glusterfs]#

> NFS crash

[2018-09-25 13:58:35.381085] I [dict.c:471:dict_get]
(-->/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x22f5d)
[0x7f93543fdf5d] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distri
bute.so(+0x202e7) [0x7f93541572e7] -->/lib64/libglusterfs.so.0(dict_get+0x10c)
[0x7f9361aefb3c] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid
argument]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2018-09-25 13:58:36
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f9361af8cc0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f9361b02c04]
/lib64/libc.so.6(+0x36280)[0x7f9360158280]
/lib64/libglusterfs.so.0(+0x3b6fa)[0x7f9361b086fa]
/lib64/libglusterfs.so.0(inode_parent+0x52)[0x7f9361b09822]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0xc243)[0x7f934f95c243]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3e1d8)[0x7f934f98e1d8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ea2b)[0x7f934f98ea2b]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ead5)[0x7f934f98ead5]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x3ecf8)[0x7f934f98ecf8]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x29d7c)[0x7f934f979d7c]
/usr/lib64/glusterfs/3.12.2/xlator/nfs/server.so(+0x2a184)[0x7f934f97a184]
/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)[0x7f93618ba955]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x10b)[0x7f93618bab3b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f93618bca73]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7566)[0x7f93566e2566]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b0c)[0x7f93566e4b0c]
/lib64/libglusterfs.so.0(+0x894c4)[0x7f9361b564c4]
/lib64/libpthread.so.0(+0x7dd5)[0x7f9360957dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f9360220b3d]
---------

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-09-26
07:02:14 EDT ---

This bug is automatically being proposed for a Z-stream release of Red Hat
Gluster Storage 3 under active development and open for bug fixes, by setting
the release flag 'rhgs‑3.4.z' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Vijay Avuthu on 2018-09-26 07:03:44 EDT ---

SOS reports:
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/nfs_crash_on_expanding_volume/

jenkin Job:
http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.5/job/auto-RHGS_Downstream_BVT_RHEL_7_5_RHGS_3_4_brew/28/consoleFull

Glusto Logs :
http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.5/job/auto-RHGS_Downstream_BVT_RHEL_7_5_RHGS_3_4_brew/ws/glusto_28.log

--- Additional comment from Jiffin on 2018-09-27 08:07:28 EDT ---

0  0x00007f9361b086fa in __inode_get_xl_index (xlator=0x7f9350018d30,
inode=0x7f933c0133b0) at inode.c:455
455            if ((inode->_ctx[xlator->xl_id].xl_key != NULL) &&
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64
libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64
libcom_err-1.42.9-12.el7_5.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64
libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64
openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64
zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f9361b086fa in __inode_get_xl_index (xlator=0x7f9350018d30,
inode=0x7f933c0133b0) at inode.c:455
#1  __inode_ref (inode=inode at entry=0x7f933c0133b0) at inode.c:537
#2  0x00007f9361b09822 in inode_parent (inode=inode at entry=0x7f933c01d990,
pargfid=pargfid at entry=0x7f93400aa2e8 "", name=name at entry=0x0) at inode.c:1359
#3  0x00007f934f95c243 in nfs_inode_loc_fill (inode=inode at entry=0x7f933c01d990,
loc=loc at entry=0x7f93400aa2b8, how=how at entry=1) at nfs-common.c:206
#4  0x00007f934f98e1d8 in nfs3_fh_resolve_inode_done
(cs=cs at entry=0x7f93400a9df0, inode=inode at entry=0x7f933c01d990) at
nfs3-helpers.c:3611
#5  0x00007f934f98ea2b in nfs3_fh_resolve_inode (cs=0x7f93400a9df0) at
nfs3-helpers.c:3828
#6  0x00007f934f98ead5 in nfs3_fh_resolve_resume (cs=cs at entry=0x7f93400a9df0)
at nfs3-helpers.c:3860
#7  0x00007f934f98ecf8 in nfs3_fh_resolve_root (cs=cs at entry=0x7f93400a9df0) at
nfs3-helpers.c:3915
#8  0x00007f934f98ef41 in nfs3_fh_resolve_and_resume
(cs=cs at entry=0x7f93400a9df0, fh=fh at entry=0x7f934e195ae0, entry=entry at entry=0x0,
resum_fn=resum_fn at entry=0x7f934f9798b0 <nfs3_access_resume>)
    at nfs3-helpers.c:4011
#9  0x00007f934f979d7c in nfs3_access (req=req at entry=0x7f934022dcd0,
fh=fh at entry=0x7f934e195ae0, accbits=31) at nfs3.c:1783
#10 0x00007f934f97a184 in nfs3svc_access (req=0x7f934022dcd0) at nfs3.c:1819
#11 0x00007f93618ba955 in rpcsvc_handle_rpc_call (svc=0x7f935002c430,
trans=trans at entry=0x7f935007a960, msg=<optimized out>) at rpcsvc.c:695
#12 0x00007f93618bab3b in rpcsvc_notify (trans=0x7f935007a960,
mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at
rpcsvc.c:789
#13 0x00007f93618bca73 in rpc_transport_notify (this=this at entry=0x7f935007a960,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f9340031290)
at rpc-transport.c:538
#14 0x00007f93566e2566 in socket_event_poll_in (this=this at entry=0x7f935007a960,
notify_handled=<optimized out>) at socket.c:2315
#15 0x00007f93566e4b0c in socket_event_handler (fd=10, idx=7, gen=46,
data=0x7f935007a960, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
#16 0x00007f9361b564c4 in event_dispatch_epoll_handler (event=0x7f934e195e80,
event_pool=0x55c696306210) at event-epoll.c:583
#17 event_dispatch_epoll_worker (data=0x7f9350043b00) at event-epoll.c:659
#18 0x00007f9360957dd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f9360220b3d in clone () from /lib64/libc.so.6

Above as part of nfs_local_filling() it was trying to find the parent inode and
there is valid inode for parent as well, but context for that inode is NULL.
>From code reading  i was not able to find place in which ctx is NULL with valid
inode

p *inode -- parent
$27 = {table = 0x7f935002d000, gfid = "{\033g\270K\202B\202\211\320B\"\373u",
<incomplete sequence \311>, lock = {spinlock = 0, mutex = {__data = {__lock =
0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next
= 0x0}}, __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000'
<repeats 19 times>, __align = 0}}, nlookup = 0, 
  fd_count = 0, active_fd_count = 0, ref = 1, ia_type = IA_IFDIR, fd_list =
{next = 0x7f933c013408, prev = 0x7f933c013408}, dentry_list = {next =
0x7f933c013418, prev = 0x7f933c013418}, hash = {
    next = 0x7f933c013428, prev = 0x7f933c013428}, list = {next =
0x7f93503a5408, prev = 0x7f935002d060}, _ctx = 0x0}

I tried to reproduce the issue(twice) but, it was not hitting in my setup.

Requesting Vijay to recheck how frequently it can be reproduced and please try
to run ith debug log level for nfs-server(diagonsis-cient log level)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=NIu533EY8g&a=cc_unsubscribe