[Gluster-infra] [Bug 1379228] smoke test fails with read/write failed (ENOTCONN)

Tue Sep 27 19:24:09 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1379228

Shyamsundar <srangana at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |srangana at redhat.com

--- Comment #9 from Shyamsundar <srangana at redhat.com> ---
My notes:

The script:
https://github.com/gluster/glusterfs-patch-acceptance-tests/blob/master/smoke.sh

Success case: https://build.gluster.org/job/smoke/30870/console
=============
10:47:56 + wait %3
  ---> This happens when %2 wait is complete, so dbench was done by this time
and the script started waiting on %3 (IOW the line printed is going to be
executed)
10:48:51 All tests successful.
10:48:51 Files=191, Tests=1960, 129 wallclock secs ( 1.28 usr  0.36 sys +  9.57
cusr  7.43 csys = 18.64 CPU)
10:48:51 Result: PASS
  ---> %3 (compliance) completed (took about 129 seconds, dbench would take
about 71-72 seconds including the warmup), so the wait above was over and we
proceed
  ---> cleanup starts
10:48:51 + rm -rf clients
10:48:53 + cd -
10:48:53 /home/jenkins/root/workspace/smoke
10:48:53 + finish
10:48:53 + RET=0
  ---> NOTE: RET here takes the output of rm -rf clients, not sure if this is
intended
10:48:53 + '[' 0 -ne 0 ']'
10:48:53 + cleanup
  ---> cleanup invoked by the finish, and this possibly has the set -x enabled
by the script (but watchdog does not see the failed case)
10:48:53 + killall -15 glusterfs glusterfsd glusterd
  ---> All well!

Failure case: https://build.gluster.org/job/smoke/30852/console
=============
00:03:16 All tests successful.
00:03:16 Files=191, Tests=1960, 93 wallclock secs ( 0.89 usr  0.26 sys +  5.46
cusr  3.30 csys =  9.91 CPU)
00:03:16 Result: PASS
00:11:36 Kicking in watchdog after 600 secs
  ---> Where are the watchdog cleanup calls noted? It appears that watchdog is
called before set -x and hence cleanup is not logged here
  ---> Assuming cleanup was called, it killed all gluster processes, and dbench
finally errored out in the read (no connection), and hence %2 completed
00:11:36 + wait %3
  ---> wait for %3 starts, and gets over ASAP as compliance has finished
running about 8 minutes back (00:03:16)
00:11:36 + rm -rf clients
00:11:36 rm: cannot remove `clients': Transport endpoint is not connected
  ---> We cannot as watchdog has cleaned up the process, so this rm -rf fails
(we failed cleanup, is this an issue for the next run?)
00:11:36 + finish
00:11:36 + RET=1
  ---> rm -rf failed, so we caught that, is this what is intended?
00:11:36 + '[' 1 -ne 0 ']'
00:11:36 + cat /build/dbench-logs
-----------
00:11:36   10  cleanup 581 sec
  ---> dbench has been attempting cleanup for 580 odd seconds
00:11:36 [643] read failed on handle 10007 (Transport endpoint is not
connected)
  ---> Finally the dbench clients get an error as watchdog shut the process and
hence the volume down and we get connection errors and dbench exits
-----------
00:11:36 + cleanup
  ---> Called by finish, and everything fails as watchdog has cleaned up
already
00:11:36 + killall -15 glusterfs glusterfsd glusterd
00:11:36 glusterfs: no process killed
00:11:36 glusterfsd: no process killed
00:11:36 glusterd: no process killed

Root cause:
===========
Looks like dbench got stuck at
https://github.com/sahlberg/dbench/blob/master/fileio.c#L400 (or pread) and
never was able to break out of it. This caused dbench never to complete till
the volume and the mount was taken down and it errored out.

Why it got stuck here, would be the next question I guess.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=Q2R8FSpovJ&a=cc_unsubscribe