[Bugs] [Bug 1663132] New: [Ganesha] Ganesha failed on one node while exporting volumes in loop

bugzilla at redhat.com bugzilla at redhat.com
Thu Jan 3 09:29:28 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1663132

            Bug ID: 1663132
           Summary: [Ganesha] Ganesha failed on one node while exporting
                    volumes in loop
           Product: GlusterFS
           Version: 4.1
          Hardware: All
                OS: All
            Status: NEW
         Component: libgfapi
          Keywords: ZStream
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: skoduri at redhat.com
        QA Contact: bugs at gluster.org
                CC: bugs at gluster.org
        Depends On: 1660577
            Blocks: 1658132, 1663131
  Target Milestone: ---
    Classification: Community



+++ This bug was initially created as a clone of Bug #1660577 +++

+++ This bug was initially created as a clone of Bug #1658132 +++

Description of problem:
-----------------------
ganesha entered failed state in one node of the four node cluster while
exporting volumes in loop. Tried to export 109 volumes one after the other in
loop.

===============================================================================
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
nfs-ganesha-2.5.5-10.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-28.el7rhgs.x86_64

===============================================================================
How reproducible:
-----------------
1/1

===============================================================================
Steps to Reproduce:
-------------------
1. Create 4 node ganesha cluster.
2. Create and start 100 or more volumes.
3. Verify status of all volumes.
4. Export volumes one after the other in a loop.

===============================================================================
Actual results:
---------------
Ganesha entered failed state in one of the nodes.

===============================================================================
Expected results:
-----------------
No failure should be observed.

==============================================================================
Additional info:
----------------
* All volumes were exported on other 3 nodes in the 4 node cluster.
* The failure observed is on a different node than the one from where export
operation was executed.

Setup is kept  in same state and can be shared if required.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-12-11
10:35:37 UTC ---

This bug is automatically being proposed for a Z-stream release of Red Hat
Gluster Storage 3 under active development and open for bug fixes, by setting
the release flag 'rhgs‑3.4.z' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Jilju Joy on 2018-12-11 10:37:00 UTC ---

Logs and sos report will be shared shortly.

--- Additional comment from Jilju Joy on 2018-12-11 11:59:20 UTC ---

Logs and sosreport :
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/jj/1658132/

--- Additional comment from Soumya Koduri on 2018-12-11 16:26:13 UTC ---

(gdb) bt
#0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
glfs-mgmt.c:625
#2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f5a08cc5760, pollin=pollin at entry=0x7f5b7f09acb0) at
rpc-clnt.c:778
#3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at
rpc-clnt.c:971
#4  0x00007f5c8e9e4a73 in rpc_transport_notify (this=this at entry=0x7f5a08cc5930,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f5b7f09acb0)
at rpc-transport.c:538
#5  0x00007f5c849e5576 in socket_event_poll_in (this=this at entry=0x7f5a08cc5930,
notify_handled=<optimized out>) at socket.c:2322
#6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1,
data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler (event=0x7f59e1f44500,
event_pool=0x7f5a08cb74f0) at event-epoll.c:583
#8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
#9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at
pthread_create.c:307
#10 0x00007f5d2050fead in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 1
#1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
glfs-mgmt.c:625
625                 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
(gdb) l
620     
621             ret = 0;
622             size = rsp.op_ret;
623     
624             if ((size == fs->oldvollen) &&
625                 (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
626                     gf_msg (frame->this->name, GF_LOG_INFO, 0,
627                             API_MSG_VOLFILE_INFO,
628                             "No change in volfile, continuing");
629                     goto out;
(gdb) p fs->olvollen
There is no member named olvollen.
(gdb) p fs->oldvollen
$1 = 1674
(gdb) p size
$2 = 1674
(gdb) p fs->oldvolfile
$3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type protocol/client\n  
 option send-gids true\n    option transport.socket.keepalive-count 9\n   
option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) p rsp.spec
$4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type protocol/client\n  
 option send-gids true\n    option transport.socket.keepalive-count 9\n   
option transport.socket.keepalive-interval 2\n    option transport.sock"...
(gdb) 


The crash happened while doing memcmp of fs->oldvolfile and the new volfile
received in the response (rsp.spec). The contents of both the variables seem
fine in the core. 

>From code reading observed that we update fs->oldvollen and fs->oldvolfile
under fs->mutex lock, but that lock is not taken while reading those values
here in glfs_mgmt_spec_cbk. That could have resulted in the crash while
accessing un/partially intialized variable.

@Jilju,

Are you able to consistently reproduce this issue?

--- Additional comment from Daniel Gryniewicz on 2018-12-11 16:33:41 UTC ---

Are the buffers smaller than 1674?  It might be going off the end of one of the
buffers.

--- Additional comment from Jilju Joy on 2018-12-12 04:50:00 UTC ---

(In reply to Soumya Koduri from comment #4)
> (gdb) bt
> #0  __memcmp_sse4_1 () at ../sysdeps/x86_64/multiarch/memcmp-sse4.S:74
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> #2  0x00007f5c8e9e8960 in rpc_clnt_handle_reply
> (clnt=clnt at entry=0x7f5a08cc5760, pollin=pollin at entry=0x7f5b7f09acb0) at
> rpc-clnt.c:778
> #3  0x00007f5c8e9e8d03 in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f5a08cc5790, event=<optimized out>, data=0x7f5b7f09acb0) at
> rpc-clnt.c:971
> #4  0x00007f5c8e9e4a73 in rpc_transport_notify
> (this=this at entry=0x7f5a08cc5930,
> event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data at entry=0x7f5b7f09acb0) at rpc-transport.c:538
> #5  0x00007f5c849e5576 in socket_event_poll_in
> (this=this at entry=0x7f5a08cc5930, notify_handled=<optimized out>) at
> socket.c:2322
> #6  0x00007f5c849e7b1c in socket_event_handler (fd=565, idx=0, gen=1,
> data=0x7f5a08cc5930, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
> #7  0x00007f5c8ec7e824 in event_dispatch_epoll_handler
> (event=0x7f59e1f44500, event_pool=0x7f5a08cb74f0) at event-epoll.c:583
> #8  event_dispatch_epoll_worker (data=0x7f5b760922a0) at event-epoll.c:659
> #9  0x00007f5d20e44dd5 in start_thread (arg=0x7f59e1f45700) at
> pthread_create.c:307
> #10 0x00007f5d2050fead in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) f 1
> #1  0x00007f5d18130664 in glfs_mgmt_getspec_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f5b74002cb0) at
> glfs-mgmt.c:625
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> (gdb) l
> 620	
> 621		ret = 0;
> 622		size = rsp.op_ret;
> 623	
> 624		if ((size == fs->oldvollen) &&
> 625		    (memcmp (fs->oldvolfile, rsp.spec, size) == 0)) {
> 626			gf_msg (frame->this->name, GF_LOG_INFO, 0,
> 627	                        API_MSG_VOLFILE_INFO,
> 628				"No change in volfile, continuing");
> 629			goto out;
> (gdb) p fs->olvollen
> There is no member named olvollen.
> (gdb) p fs->oldvollen
> $1 = 1674
> (gdb) p size
> $2 = 1674
> (gdb) p fs->oldvolfile
> $3 = 0x7f5b76097cd0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) p rsp.spec
> $4 = 0x7f5b7f9da9d0 "volume testvol82201-client-0\n    type
> protocol/client\n    option send-gids true\n    option
> transport.socket.keepalive-count 9\n    option
> transport.socket.keepalive-interval 2\n    option transport.sock"...
> (gdb) 
> 
> 
> The crash happened while doing memcmp of fs->oldvolfile and the new volfile
> received in the response (rsp.spec). The contents of both the variables seem
> fine in the core. 
> 
> From code reading observed that we update fs->oldvollen and fs->oldvolfile
> under fs->mutex lock, but that lock is not taken while reading those values
> here in glfs_mgmt_spec_cbk. That could have resulted in the crash while
> accessing un/partially intialized variable.
> 
> @Jilju,
> 
> Are you able to consistently reproduce this issue?

Hi Soumya,

The first occurrence is reported here. Kept the setup in same state for the
favour of debugging. I can share the setup if required or I can try to
reproduce.

--- Additional comment from Worker Ant on 2018-12-18 17:05:42 UTC ---

REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under
mutex lock) posted (#1) for review on master by soumya k

--- Additional comment from Worker Ant on 2018-12-26 02:17:03 UTC ---

REVIEW: https://review.gluster.org/21882 (gfapi: Access fs->oldvolfile under
mutex lock) posted (#2) for review on master by Amar Tumballi

--- Additional comment from Worker Ant on 2018-12-26 10:33:07 UTC ---

REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to
releasing fs->mutex lock) posted (#1) for review on master by soumya k

--- Additional comment from Worker Ant on 2018-12-31 16:10:41 UTC ---

REVIEW: https://review.gluster.org/21927 (gfapi: nit cleanup related to
releasing fs->mutex lock) posted (#2) for review on master by Kaleb KEITHLEY


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1658132
[Bug 1658132] [Ganesha] Ganesha failed on one node while exporting volumes in
loop
https://bugzilla.redhat.com/show_bug.cgi?id=1660577
[Bug 1660577] [Ganesha] Ganesha failed on one node while exporting volumes in
loop
https://bugzilla.redhat.com/show_bug.cgi?id=1663131
[Bug 1663131] [Ganesha] Ganesha failed on one node while exporting volumes in
loop
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list