[Bugs] [Bug 1663132] New: [Ganesha] Ganesha failed on one node while exporting volumes in loop

Thu Jan 3 09:29:28 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1663132

            Bug ID: 1663132
           Summary: [Ganesha] Ganesha failed on one node while exporting
                    volumes in loop
           Product: GlusterFS
           Version: 4.1
          Hardware: All
                OS: All
            Status: NEW
         Component: libgfapi
          Keywords: ZStream
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: skoduri at redhat.com
        QA Contact: bugs at gluster.org
                CC: bugs at gluster.org
        Depends On: 1660577
            Blocks: 1658132, 1663131
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1660577 +++

+++ This bug was initially created as a clone of Bug #1658132 +++

Description of problem:
-----------------------
ganesha entered failed state in one node of the four node cluster while
exporting volumes in loop. Tried to export 109 volumes one after the other in
loop.

===============================================================================
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
nfs-ganesha-2.5.5-10.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-28.el7rhgs.x86_64

===============================================================================
How reproducible:
-----------------
1/1

===============================================================================
Steps to Reproduce:
-------------------
1. Create 4 node ganesha cluster.
2. Create and start 100 or more volumes.
3. Verify status of all volumes.
4. Export volumes one after the other in a loop.

===============================================================================
Actual results:
---------------
Ganesha entered failed state in one of the nodes.

===============================================================================
Expected results:
-----------------
No failure should be observed.

==============================================================================
Additional info:
----------------
* All volumes were exported on other 3 nodes in the 4 node cluster.
* The failure observed is on a different node than the one from where export
operation was executed.

Setup is kept  in same state and can be shared if required.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-12-11
10:35:37 UTC ---

This bug is automatically being proposed for a Z-stream release of Red Hat
Gluster Storage 3 under active development and open for bug fixes, by setting
the release flag 'rhgs‑3.4.z' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Jilju Joy on 2018-12-11 10:37:00 UTC ---

Logs and sos report will be shared shortly.

--- Additional comment from Jilju Joy on 2018-12-11 11:59:20 UTC ---

Logs and sosreport :
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/jj/1658132/

--- Additional comment from Soumya Koduri on 2018-12-11 16:26:13 UTC ---

(gdb) bt
#0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
glfs-mgmt.c:625
#2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f5a08cc5760, pollin=pollin at entry=0x7f5b7f09acb0) at
rpc-clnt.c:778
#3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at
rpc-clnt.c:971
#4  0x00007f5c8e9e4a73 in rpc_transport_notify (this=this at entry=0x7f5a08cc5930,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f5b7f09acb0)
at rpc-transport.c:538
#5  0x00007f5c849e5576 in socket_event_poll_in (this=this at entry=0x7f5a08cc5930,
notify_handled=<optimized out>) at socket.c:2322
#6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1,
data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler (event=0x7f59e1f44500,
event_pool=0x7f5a08cb74f0) at event-epoll.c:583
#8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
#9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at
pthread_create.c:307
#10 0x00007f5d2050fead in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 1
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
glfs-mgmt.c:625
625                 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
(gdb) l
620     
621             ret = 0;
622             size = rsp.op_ret;
623     
624             if ((size == fs->oldvollen) &&
625                 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
626                     gf_msg (frame->this->name, GF_LOG_INFO, 0,
627                             API_MSG_VOLFILE_INFO,
628                             "No change in volfile, continuing");
629                     goto out;
(gdb) p fs->olvollen
There is no member named olvollen.
(gdb) p fs->oldvollen
$1 = 1674
(gdb) p size
$2 = 1674
(gdb) p fs->oldvolfile
$3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type protocol/client\n  
 option send-gids true\n    option transport.socket.keepalive-count 9\n   
option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) p rsp.spec
$4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type protocol/client\n  
 option send-gids true\n    option transport.socket.keepalive-count 9\n   
option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) 

The crash happened while doing memcmp of fs->oldvolfile and the new volfile
received in the response (rsp.spec). The contents of both the variables seem
fine in the core. 

>From code reading observed that we update fs->oldvollen and fs->oldvolfile
under fs->mutex lock, but that lock is not taken while reading those values
here in glfs_mgmt_spec_cbk. That could have resulted in the crash while
accessing un/partially intialized variable.

@Jilju,

Are you able to consistently reproduce this issue?

--- Additional comment from Daniel Gryniewicz on 2018-12-11 16:33:41 UTC ---

Are the buffers smaller than 1674?  It might be going off the end of one of the
buffers.

--- Additional comment from Jilju Joy on 2018-12-12 04:50:00 UTC ---

(In reply to Soumya Koduri from comment #4)
> (gdb) bt
> #0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> #2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply
> (clnt=clnt at entry=0x7f5a08cc5760, pollin=pollin at entry=0x7f5b7f09acb0) at
> rpc-clnt.c:778
> #3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at
> rpc-clnt.c:971
> #4  0x00007f5c8e9e4a73 in rpc_transport_notify
> (this=this at entry=0x7f5a08cc5930,
> event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data at entry=0x7f5b7f09acb0) at rpc-transport.c:538
> #5  0x00007f5c849e5576 in socket_event_poll_in
> (this=this at entry=0x7f5a08cc5930, notify_handled=<optimized out>) at
> socket.c:2322
> #6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1,
> data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
> #7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler
> (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
> #8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
> #9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at
> pthread_create.c:307
> #10 0x00007f5d2050fead in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) f 1
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> (gdb) l
> 620	
> 621		ret = 0;
> 622		size = rsp.op_ret;
> 623	
> 624		if ((size == fs->oldvollen) &&
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> 626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
> 627	                        API_MSG_VOLFILE_INFO,
> 628				"No change in volfile, continuing");
> 629			goto out;
> (gdb) p fs->olvollen
> There is no member named olvollen.
> (gdb) p fs->oldvollen
> $1 = 1674
> (gdb) p size
> $2 = 1674
> (gdb) p fs->oldvolfile
> $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) p rsp.spec
> $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) 
> 
> 
> The crash happened while doing memcmp of fs->oldvolfile and the new volfile
> received in the response (rsp.spec). The contents of both the variables seem
> fine in the core. 
> 
> From code reading observed that we update fs->oldvollen and fs->oldvolfile
> under fs->mutex lock, but that lock is not taken while reading those values
> here in glfs_mgmt_spec_cbk. That could have resulted in the crash while
> accessing un/partially intialized variable.
> 
> @Jilju,
> 
> Are you able to consistently reproduce this issue?

Hi Soumya,

The first occurrence is reported here. Kept the setup in same state for the
favour of debugging. I can share the setup if required or I can try to
reproduce.

--- Additional comment from Worker Ant on 2018-12-18 17:05:42 UTC ---

REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under
mutex lock) posted (#1) for review on master by soumya k

--- Additional comment from Worker Ant on 2018-12-26 02:17:03 UTC ---

REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under
mutex lock) posted (#2) for review on master by Amar Tumballi

--- Additional comment from Worker Ant on 2018-12-26 10:33:07 UTC ---

REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to
releasing fs->mutex lock) posted (#1) for review on master by soumya k

--- Additional comment from Worker Ant on 2018-12-31 16:10:41 UTC ---

REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to
releasing fs->mutex lock) posted (#2) for review on master by Kaleb KEITHLEY

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1658132
[Bug 1658132] [Ganesha] Ganesha failed on one node while exporting volumes in
loop
https://bugzilla.redhat.com/show_bug.cgi?id=1660577
[Bug 1660577] [Ganesha] Ganesha failed on one node while exporting volumes in
loop
https://bugzilla.redhat.com/show_bug.cgi?id=1663131
[Bug 1663131] [Ganesha] Ganesha failed on one node while exporting volumes in
loop
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.