[Gluster-devel] Spurious regression failure analysis from runs over the wkend

Shyam srangana at redhat.com
Mon Feb 23 19:32:41 UTC 2015


On 02/23/2015 01:58 PM, Justin Clift wrote:
> Short version:
>
> 75% of the Jenkins regression tests we run in Rackspace (on
> glusterfs master branch) fail from spurious errors.
>
> This is why we're having capacity problems with our Jenkins
> slave nodes... we need to run our tests 4x for each CR just
> to get a potentially valid result. :/
>
>
> Longer version:
>
> Ran some regression test runs (20) on git master head over the
> weekend, to better understand our spurious failure situation.
>
> 75% of the regression runs failed in various ways.  Oops. ;)
>
> The failures:
>
>    * 5 x tests/bugs/fuse/bug-1126048.t
>          Failed test:  10
>
>    * 3 x tests/bugs/quota/bug-1087198.t
>          Failed test:  18
>
>    * 3 x tests/performance/open-behind.t
>          Failed test:  17
>
>    * 2 x tests/bugs/geo-replication/bug-877293.t
>          Failed test:  11
>
>    * 2 x tests/basic/afr/split-brain-heal-info.t
>          Failed tests:  20-41
>
>    * 1 x tests/bugs/distribute/bug-1117851.t
>          Failed test:  15
>
>    * 1 x tests/basic/uss.t
>          Failed test:  26
>
>    * 1 x hung on tests/bugs/posix/bug-1113960.t
>
>          No idea which test it was on.  Left it running
>          several hours, then killed the VM along with the rest.
>
> 4 of the regression runs also created coredumps.  Uploaded the
> archived_builds and logs here:
>
>      http://mirror.salasaga.org/gluster/
>
> (are those useful?)

Yes, these are useful as they contain a very similar crash in each of 
the cores, so we could be looking at a single problem to fix here. Here 
is a short update on the core, at a broad level the cleanup_and_exit is 
racing with a list deletion in the following 2 threads.

Those interested can download and extract the tarballs from the link 
provided, (ex: 
http://mirror.salasaga.org/gluster/bulkregression12/archived_builds/build-install-20150222%3a19%3a58%3a21.tar.bz2 
)
and run, "gdb -ex 'set sysroot ./' -ex 'core-file 
./build/install/cores/core.28008' ./build/install/sbin/glusterfsd" from 
the root of the extracted tarball to look at the details from the core dump.

Core was generated by `/build/install/sbin/glusterfsd -s 
bulkregression12.localdomain --volfile-id pat'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at 
/root/glusterfs/libglusterfs/src/list.h:88
88      /root/glusterfs/libglusterfs/src/list.h: No such file or directory.

1) list deletion generates the core:

(gdb) bt
#0  0x00007fd84a1340de in list_del_init (old=0x7fd834000d50) at 
/root/glusterfs/libglusterfs/src/list.h:88
#1  0x00007fd84a1352ae in pl_inodelk_client_cleanup 
(this=0x7fd84400b7e0, ctx=0x7fd834000b50) at 
/root/glusterfs/xlators/features/locks/src/inodelk.c:471
#2  0x00007fd84a131805 in pl_client_disconnect_cbk (this=0x7fd84400b7e0, 
client=0x7fd83c002fd0) at 
/root/glusterfs/xlators/features/locks/src/posix.c:2563
#3  0x00007fd85bd52139 in gf_client_disconnect (client=0x7fd83c002fd0) 
at /root/glusterfs/libglusterfs/src/client_t.c:393
#4  0x00007fd849262296 in server_connection_cleanup 
(this=0x7fd844014350, client=0x7fd83c002fd0, flags=3) at 
/root/glusterfs/xlators/protocol/server/src/server-helpers.c:353
#5  0x00007fd84925dcca in server_rpc_notify (rpc=0x7fd844023b70, 
xl=0x7fd844014350, event=RPCSVC_EVENT_DISCONNECT, data=0x7fd83c001440) 
at /root/glusterfs/xlators/protocol/server/src/server.c:532
#6  0x00007fd85baaa021 in rpcsvc_handle_disconnect (svc=0x7fd844023b70, 
trans=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:741
#7  0x00007fd85baaa1ba in rpcsvc_notify (trans=0x7fd83c001440, 
mydata=0x7fd844023b70, event=RPC_TRANSPORT_DISCONNECT, 
data=0x7fd83c001440) at /root/glusterfs/rpc/rpc-lib/src/rpcsvc.c:779
#8  0x00007fd85baaf4a4 in rpc_transport_notify (this=0x7fd83c001440, 
event=RPC_TRANSPORT_DISCONNECT, data=0x7fd83c001440) at 
/root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:543
#9  0x00007fd850c8fbc0 in socket_event_poll_err (this=0x7fd83c001440) at 
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:1185
#10 0x00007fd850c9457e in socket_event_handler (fd=14, idx=5, 
data=0x7fd83c001440, poll_in=1, poll_out=0, poll_err=0) at 
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2386
#11 0x00007fd85bd55333 in event_dispatch_epoll_handler 
(event_pool=0x1d835d0, event=0x7fd84b5a9e70) at 
/root/glusterfs/libglusterfs/src/event-epoll.c:551
#12 0x00007fd85bd5561d in event_dispatch_epoll_worker (data=0x1db0790) 
at /root/glusterfs/libglusterfs/src/event-epoll.c:643
#13 0x00007fd85b24f9d1 in start_thread () from ./lib64/libpthread.so.0
#14 0x00007fd85abb98fd in clone () from ./lib64/libc.so.6

2) Parallel cleanup in progress, (see frame #12 on cleanup_and_exit)

Thread 12 (LWP 28010):
#0  0x00007f8620a31f48 in _nss_files_parse_servent () from 
./lib64/libnss_files.so.2
#1  0x00007f8620a326b0 in _nss_files_getservbyport_r () from 
./lib64/libnss_files.so.2
#2  0x00007f862b595c39 in getservbyport_r@@GLIBC_2.2.5 () from 
./lib64/libc.so.6
#3  0x00007f862b59c536 in getnameinfo () from ./lib64/libc.so.6
#4  0x00007f862c6beb64 in gf_resolve_ip6 (hostname=0x1702860 
"bulkregression16.localdomain", port=24007, family=2, 
dnscache=0x1715748, addr_info=0x7f861b662930) at 
/root/glusterfs/libglusterfs/src/common-utils.c:240
#5  0x00007f86220594c3 in af_inet_client_get_remote_sockaddr 
(this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8) 
at /root/glusterfs/rpc/rpc-transport/socket/src/name.c:238
#6  0x00007f8622059eba in socket_client_get_remote_sockaddr 
(this=0x17156d0, sockaddr=0x7f861b662a10, sockaddr_len=0x7f861b662aa8, 
sa_family=0x7f861b662aa6) at 
/root/glusterfs/rpc/rpc-transport/socket/src/name.c:496
#7  0x00007f8622055c1b in socket_connect (this=0x17156d0, port=0) at 
/root/glusterfs/rpc/rpc-transport/socket/src/socket.c:2914
#8  0x00007f862c46dfe1 in rpc_transport_connect (this=0x17156d0, port=0) 
at /root/glusterfs/rpc/rpc-lib/src/rpc-transport.c:426
#9  0x00007f862c473655 in rpc_clnt_submit (rpc=0x1713c80, prog=0x614620 
<clnt_pmap_prog>, procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, 
proghdr=0x7f861b662cf0, proghdrcount=1, progpayload=0x0, 
progpayloadcount=0,
     iobref=0x7f85fc000f60, frame=0x7f862a513de0, rsphdr=0x0, 
rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at 
/root/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:1554
#10 0x000000000040d725 in mgmt_submit_request (req=0x7f861b663d60, 
frame=0x7f862a513de0, ctx=0x16cb010, prog=0x614620 <clnt_pmap_prog>, 
procnum=5, cbkfn=0x40f0e9 <mgmt_pmap_signout_cbk>, xdrproc=0x4048d0 
<xdr_pmap_signout_req at plt>)
     at /root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:1445
#11 0x000000000040f38d in glusterfs_mgmt_pmap_signout (ctx=0x16cb010) at 
/root/glusterfs/glusterfsd/src/glusterfsd-mgmt.c:2258
#12 0x0000000000407903 in cleanup_and_exit (signum=15) at 
/root/glusterfs/glusterfsd/src/glusterfsd.c:1201
#13 0x0000000000408ecf in glusterfs_sigwaiter (arg=0x7fff49a90520) at 
/root/glusterfs/glusterfsd/src/glusterfsd.c:1761
#14 0x00007f862bc0e9d1 in start_thread () from ./lib64/libpthread.so.0
#15 0x00007f862b5788fd in clone () from ./lib64/libc.so.6

>
> We should probably concentrate on fixing the most common
> spurious failures soon, and look into the less common ones
> later on.
>
> I'll do some runs on release-3.6 soon too, as I suspect that'll
> be useful.
>
> + Justin
>
> --
> GlusterFS - http://www.gluster.org
>
> An open source, distributed file system scaling to several
> petabytes, and handling thousands of clients.
>
> My personal twitter: twitter.com/realjustinclift
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>


More information about the Gluster-devel mailing list