[Bugs] [Bug 1219967] New: glusterfsd core dumps when cleanup and socket disconnect routines race
bugzilla at redhat.com
bugzilla at redhat.com
Fri May 8 20:29:28 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1219967
Bug ID: 1219967
Summary: glusterfsd core dumps when cleanup and socket
disconnect routines race
Product: GlusterFS
Version: 3.6.3
Component: rpc
Keywords: Triaged
Assignee: bugs at gluster.org
Reporter: srangana at redhat.com
CC: bugs at gluster.org, gluster-bugs at redhat.com,
jclift at redhat.com, rgowdapp at redhat.com
Depends On: 1195415
Blocks: 1199352 (glusterfs-3.7.0)
+++ This bug was initially created as a clone of Bug #1195415 +++
Description of problem:
Best described in the mail chain which is copied below:
http://www.gluster.org/pipermail/gluster-devel/2015-February/043958.html
On 02/23/2015 01:58 PM, Justin Clift wrote:
> Short version:
>
> 75% of the Jenkins regression tests we run in Rackspace (on
> glusterfs master branch) fail from spurious errors.
>
> This is why we're having capacity problems with our Jenkins
> slave nodes... we need to run our tests 4x for each CR just
> to get a potentially valid result. :/
>
>
> Longer version:
>
> Ran some regression test runs (20) on git master head over the
> weekend, to better understand our spurious failure situation.
>
> 75% of the regression runs failed in various ways. Oops.
>
> The failures:
>
> * 5 x tests/bugs/fuse/bug-1126048.t
> Failed test: 10
>
> * 3 x tests/bugs/quota/bug-1087198.t
> Failed test: 18
>
> * 3 x tests/performance/open-behind.t
> Failed test: 17
>
> * 2 x tests/bugs/geo-replication/bug-877293.t
> Failed test: 11
>
> * 2 x tests/basic/afr/split-brain-heal-info.t
> Failed tests: 20-41
>
> * 1 x tests/bugs/distribute/bug-1117851.t
> Failed test: 15
>
> * 1 x tests/basic/uss.t
> Failed test: 26
>
> * 1 x hung on tests/bugs/posix/bug-1113960.t
>
> No idea which test it was on. Left it running
> several hours, then killed the VM along with the rest.
>
> 4 of the regression runs also created coredumps. Uploaded the
> archived_builds and logs here:
>
> http://mirror.salasaga.org/gluster/
>
> (are those useful?)
Yes, these are useful as they contain a very similar crash in each of the
cores, so we could be looking at a single problem to fix here. Here is a short
update on the core, at a broad level the cleanup_and_exit is racing with a list
deletion in the following 2 threads.
Those interested can download and extract the tarballs from the link provided,
(ex:
http://mirror.salasaga.org/gluster/bulkregression12/archived_builds/build-install-20150222%3a19%3a58%3a21.tar.bz2
)
and run, "gdb -ex 'set sysroot ./' -ex 'core-file
./build/install/cores/core.28008' ./build/install/sbin/glusterfsd" from the
root of the extracted tarball to look at the details from the core dump.
Core was generated by `/build/install/sbin/glusterfsd -s
bulkregression12.localdomain --volfile-id pat'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at
/root/glusterfs/libglusterfs/src/list.h:88
88 /root/glusterfs/libglusterfs/src/list.h: No such file or directory.
1) list deletion generates the core:
(gdb) bt
#0 0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at
/root/glusterfs/libglusterfs/src/list.h:88
#1 0x00007fd84a1352ae in pl_inodelk_client_cleanup (this=0x7fd84400b7e0,
ctx=0x7fd834000b50) at /root/glusterfs/xlators/features/locks/src/inodelk.c:471
#2 0x00007fd84a131805 in pl_client_disconnect_cbk (this=0x7fd84400b7e0,
client=0x7fd83c002fd0) at
/root/glusterfs/xlators/features/locks/src/posix.c:2563
#3 0x00007fd85bd52139 in gf_client_disconnect (client=0x7fd83c002fd0) at
/root/glusterfs/libglusterfs/src/client_t.c:393
#4 0x00007fd849262296 in server_connection_cleanup (this=0x7fd844014350,
client=0x7fd83c002fd0, flags=3) at
/root/glusterfs/xlators/protocol/server/src/server-helpers.c:353
#5 0x00007fd84925dcca in server_rpc_notify (rpc=0x7fd844023b70,
xl=0x7fd844014350, event=RPCSVC_EVENT_DISCONNECT, data=0x7fd83c001440) at
/root/glusterfs/xlators/protocol/server/src/server.c:532
#6 0x00007fd85baaa021 in rpcsvc_handle_disconnect (svc=0x7fd844023b70,
trans=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:741
#7 0x00007fd85baaa1ba in rpcsvc_notify (trans=0x7fd83c001440,
mydata=0x7fd844023b70, event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at
/root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:779
#8 0x00007fd85baaf4a4 in rpc_transport_notify (this=0x7fd83c001440,
event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at
/root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:543
#9 0x00007fd850c8fbc0 in socket_event_poll_err (this=0x7fd83c001440) at
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:1185
#10 0x00007fd850c9457e in socket_event_handler (fd=14, idx=5,
data=0x7fd83c001440, poll_in=1, poll_out=0, poll_err=0) at
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2386
#11 0x00007fd85bd55333 in event_dispatch_epoll_handler (event_pool=0x1d835d0,
event=0x7fd84b5a9e70) at /root/glusterfs/libglusterfs/src/event-epoll.c:551
#12 0x00007fd85bd5561d in event_dispatch_epoll_worker (data=0x1db0790) at
/root/glusterfs/libglusterfs/src/event-epoll.c:643
#13 0x00007fd85b24f9d1 in start_thread () from ./lib64/libpthread.so.0
#14 0x00007fd85abb98fd in clone () from ./lib64/libc.so.6
2) Parallel cleanup in progress, (see frame #12 on cleanup_and_exit)
Thread 12 (LWP 28010):
#0 0x00007f8620a31f48 in _nss_files_parse_servent () from
./lib64/libnss_files.so.2
#1 0x00007f8620a326b0 in _nss_files_getservbyport_r () from
./lib64/libnss_files.so.2
#2 0x00007f862b595c39 in getservbyport_r@@GLIBC_2.2.5 () from
./lib64/libc.so.6
#3 0x00007f862b59c536 in getnameinfo () from ./lib64/libc.so.6
#4 0x00007f862c6beb64 in gf_resolve_ip6 (hostname=0x1702860
"bulkregression16.localdomain", port=24007, family=2, dnscache=0x1715748,
addr_info=0x7f861b662930) at
/root/glusterfs/libglusterfs/src/common-utils.c:240
#5 0x00007f86220594c3 in af_inet_client_get_remote_sockaddr (this=0x17156d0,
sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8) at
/root/glusterfs/rpc/rpc-transport/socket/src/name.c:238
#6 0x00007f8622059eba in socket_client_get_remote_sockaddr (this=0x17156d0,
sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8, sa_family=0x7f861b662aa6)
at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:496
#7 0x00007f8622055c1b in socket_connect (this=0x17156d0, port=0) at
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2914
#8 0x00007f862c46dfe1 in rpc_transport_connect (this=0x17156d0, port=0) at
/root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:426
#9 0x00007f862c473655 in rpc_clnt_submit (rpc=0x1713c80, prog=0x614620
<clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>,
proghdr=0x7f861b662cf0, proghdrcount=1, progpayload=0x0, progpayloadcount=0,
iobref=0x7f85fc000f60, frame=0x7f862a513de0, rsphdr=0x0, rsphdr_count=0,
rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at
/root/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:1554
#10 0x000000000040d725 in mgmt_submit_request (req=0x7f861b663d60,
frame=0x7f862a513de0, ctx=0x16cb010, prog=0x614620 <clnt_pmap_prog>, procnum=5,
cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, xdrproc=0x4048d0
<xdr_pmap_signout_req at plt>)
at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:1445
#11 0x000000000040f38d in glusterfs_mgmt_pmap_signout (ctx=0x16cb010) at
/root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:2258
#12 0x0000000000407903 in cleanup_and_exit (signum=15) at
/root/glusterfs/glusterfsd/src/glusterfsd.c:1201
#13 0x0000000000408ecf in glusterfs_sigwaiter (arg=0x7fff49a90520) at
/root/glusterfs/glusterfsd/src/glusterfsd.c:1761
#14 0x00007f862bc0e9d1 in start_thread () from ./lib64/libpthread.so.0
#15 0x00007f862b5788fd in clone () from ./lib64/libc.so.6
>
> We should probably concentrate on fixing the most common
> spurious failures soon, and look into the less common ones
> later on.
>
> I'll do some runs on release-3.6 soon too, as I suspect that'll
> be useful.
>
> + Justin
This is a regression test on master branch as of the day specified.
--- Additional comment from Anand Avati on 2015-04-08 17:21:17 EDT ---
REVIEW: http://review.gluster.org/10167 (tests: remove tests for clear-locks)
posted (#1) for review on master by Jeff Darcy (jdarcy at redhat.com)
--- Additional comment from Anand Avati on 2015-04-09 05:51:17 EDT ---
COMMIT: http://review.gluster.org/10167 committed in master by Vijay Bellur
(vbellur at redhat.com)
------
commit 0086a55bb7de1ef5dc7a24583f5fc2b560e835fd
Author: Jeff Darcy <jdarcy at redhat.com>
Date: Wed Apr 8 17:17:13 2015 -0400
tests: remove tests for clear-locks
These are suspected of causing core dumps during regression tests,
leading to spurious failures. Per email conversation, since this
isn't a supported feature anyway, the tests are being removed to
facilitate testing of features we do support.
Change-Id: I7fd5c76d26dd6c3ffa91f89fc10469ae3a63afdf
BUG: 1195415
Signed-off-by: Jeff Darcy <jdarcy at redhat.com>
Reviewed-on: http://review.gluster.org/10167
Tested-by: Gluster Build System <jenkins at build.gluster.com>
Reviewed-by: Kaleb KEITHLEY <kkeithle at redhat.com>
Reviewed-by: Vijay Bellur <vbellur at redhat.com>
--- Additional comment from Justin Clift on 2015-04-09 07:18:48 EDT ---
Pranith pointed out this may be a duplicate of
https://bugzilla.redhat.com/show_bug.cgi?id=1184417.
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1195415
[Bug 1195415] glusterfsd core dumps when cleanup and socket disconnect
routines race
https://bugzilla.redhat.com/show_bug.cgi?id=1199352
[Bug 1199352] GlusterFS 3.7.0 tracker
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list