[Gluster-devel] Spurious regression failure analysis from runs over the wkend
Shyam
srangana at redhat.com
Mon Feb 23 19:32:41 UTC 2015
On 02/23/2015 01:58 PM, Justin Clift wrote:
> Short version:
>
> 75% of the Jenkins regression tests we run in Rackspace (on
> glusterfs master branch) fail from spurious errors.
>
> This is why we're having capacity problems with our Jenkins
> slave nodes... we need to run our tests 4x for each CR just
> to get a potentially valid result. :/
>
>
> Longer version:
>
> Ran some regression test runs (20) on git master head over the
> weekend, to better understand our spurious failure situation.
>
> 75% of the regression runs failed in various ways. Oops. ;)
>
> The failures:
>
> * 5 x tests/bugs/fuse/bug-1126048.t
> Failed test: 10
>
> * 3 x tests/bugs/quota/bug-1087198.t
> Failed test: 18
>
> * 3 x tests/performance/open-behind.t
> Failed test: 17
>
> * 2 x tests/bugs/geo-replication/bug-877293.t
> Failed test: 11
>
> * 2 x tests/basic/afr/split-brain-heal-info.t
> Failed tests: 20-41
>
> * 1 x tests/bugs/distribute/bug-1117851.t
> Failed test: 15
>
> * 1 x tests/basic/uss.t
> Failed test: 26
>
> * 1 x hung on tests/bugs/posix/bug-1113960.t
>
> No idea which test it was on. Left it running
> several hours, then killed the VM along with the rest.
>
> 4 of the regression runs also created coredumps. Uploaded the
> archived_builds and logs here:
>
> http://mirror.salasaga.org/gluster/
>
> (are those useful?)
Yes, these are useful as they contain a very similar crash in each of
the cores, so we could be looking at a single problem to fix here. Here
is a short update on the core, at a broad level the cleanup_and_exit is
racing with a list deletion in the following 2 threads.
Those interested can download and extract the tarballs from the link
provided, (ex:
http://mirror.salasaga.org/gluster/bulkregression12/archived_builds/build-install-20150222%3a19%3a58%3a21.tar.bz2
)
and run, "gdb -ex 'set sysroot ./' -ex 'core-file
./build/install/cores/core.28008' ./build/install/sbin/glusterfsd" from
the root of the extracted tarball to look at the details from the core dump.
Core was generated by `/build/install/sbin/glusterfsd -s
bulkregression12.localdomain --volfile-id pat'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at
/root/glusterfs/libglusterfs/src/list.h:88
88 /root/glusterfs/libglusterfs/src/list.h: No such file or directory.
1) list deletion generates the core:
(gdb) bt
#0 0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at
/root/glusterfs/libglusterfs/src/list.h:88
#1 0x00007fd84a1352ae in pl_inodelk_client_cleanup
(this=0x7fd84400b7e0, ctx=0x7fd834000b50) at
/root/glusterfs/xlators/features/locks/src/inodelk.c:471
#2 0x00007fd84a131805 in pl_client_disconnect_cbk (this=0x7fd84400b7e0,
client=0x7fd83c002fd0) at
/root/glusterfs/xlators/features/locks/src/posix.c:2563
#3 0x00007fd85bd52139 in gf_client_disconnect (client=0x7fd83c002fd0)
at /root/glusterfs/libglusterfs/src/client_t.c:393
#4 0x00007fd849262296 in server_connection_cleanup
(this=0x7fd844014350, client=0x7fd83c002fd0, flags=3) at
/root/glusterfs/xlators/protocol/server/src/server-helpers.c:353
#5 0x00007fd84925dcca in server_rpc_notify (rpc=0x7fd844023b70,
xl=0x7fd844014350, event=RPCSVC_EVENT_DISCONNECT, data=0x7fd83c001440)
at /root/glusterfs/xlators/protocol/server/src/server.c:532
#6 0x00007fd85baaa021 in rpcsvc_handle_disconnect (svc=0x7fd844023b70,
trans=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:741
#7 0x00007fd85baaa1ba in rpcsvc_notify (trans=0x7fd83c001440,
mydata=0x7fd844023b70, event=RPC_TRANSPORT_DISCONNECT,
data=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:779
#8 0x00007fd85baaf4a4 in rpc_transport_notify (this=0x7fd83c001440,
event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at
/root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:543
#9 0x00007fd850c8fbc0 in socket_event_poll_err (this=0x7fd83c001440) at
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:1185
#10 0x00007fd850c9457e in socket_event_handler (fd=14, idx=5,
data=0x7fd83c001440, poll_in=1, poll_out=0, poll_err=0) at
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2386
#11 0x00007fd85bd55333 in event_dispatch_epoll_handler
(event_pool=0x1d835d0, event=0x7fd84b5a9e70) at
/root/glusterfs/libglusterfs/src/event-epoll.c:551
#12 0x00007fd85bd5561d in event_dispatch_epoll_worker (data=0x1db0790)
at /root/glusterfs/libglusterfs/src/event-epoll.c:643
#13 0x00007fd85b24f9d1 in start_thread () from ./lib64/libpthread.so.0
#14 0x00007fd85abb98fd in clone () from ./lib64/libc.so.6
2) Parallel cleanup in progress, (see frame #12 on cleanup_and_exit)
Thread 12 (LWP 28010):
#0 0x00007f8620a31f48 in _nss_files_parse_servent () from
./lib64/libnss_files.so.2
#1 0x00007f8620a326b0 in _nss_files_getservbyport_r () from
./lib64/libnss_files.so.2
#2 0x00007f862b595c39 in getservbyport_r@@GLIBC_2.2.5 () from
./lib64/libc.so.6
#3 0x00007f862b59c536 in getnameinfo () from ./lib64/libc.so.6
#4 0x00007f862c6beb64 in gf_resolve_ip6 (hostname=0x1702860
"bulkregression16.localdomain", port=24007, family=2,
dnscache=0x1715748, addr_info=0x7f861b662930) at
/root/glusterfs/libglusterfs/src/common-utils.c:240
#5 0x00007f86220594c3 in af_inet_client_get_remote_sockaddr
(this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8)
at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:238
#6 0x00007f8622059eba in socket_client_get_remote_sockaddr
(this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8,
sa_family=0x7f861b662aa6) at
/root/glusterfs/rpc/rpc-transport/socket/src/name.c:496
#7 0x00007f8622055c1b in socket_connect (this=0x17156d0, port=0) at
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2914
#8 0x00007f862c46dfe1 in rpc_transport_connect (this=0x17156d0, port=0)
at /root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:426
#9 0x00007f862c473655 in rpc_clnt_submit (rpc=0x1713c80, prog=0x614620
<clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>,
proghdr=0x7f861b662cf0, proghdrcount=1, progpayload=0x0,
progpayloadcount=0,
iobref=0x7f85fc000f60, frame=0x7f862a513de0, rsphdr=0x0,
rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at
/root/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:1554
#10 0x000000000040d725 in mgmt_submit_request (req=0x7f861b663d60,
frame=0x7f862a513de0, ctx=0x16cb010, prog=0x614620 <clnt_pmap_prog>,
procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, xdrproc=0x4048d0
<xdr_pmap_signout_req at plt>)
at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:1445
#11 0x000000000040f38d in glusterfs_mgmt_pmap_signout (ctx=0x16cb010) at
/root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:2258
#12 0x0000000000407903 in cleanup_and_exit (signum=15) at
/root/glusterfs/glusterfsd/src/glusterfsd.c:1201
#13 0x0000000000408ecf in glusterfs_sigwaiter (arg=0x7fff49a90520) at
/root/glusterfs/glusterfsd/src/glusterfsd.c:1761
#14 0x00007f862bc0e9d1 in start_thread () from ./lib64/libpthread.so.0
#15 0x00007f862b5788fd in clone () from ./lib64/libc.so.6
>
> We should probably concentrate on fixing the most common
> spurious failures soon, and look into the less common ones
> later on.
>
> I'll do some runs on release-3.6 soon too, as I suspect that'll
> be useful.
>
> + Justin
>
> --
> GlusterFS - http://www.gluster.org
>
> An open source, distributed file system scaling to several
> petabytes, and handling thousands of clients.
>
> My personal twitter: twitter.com/realjustinclift
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
More information about the Gluster-devel
mailing list