[Gluster-devel] NetBSD hanging regression tests

Sat Mar 7 05:06:50 UTC 2015

Hi 

Recently NetBSD regression tests started hanging quite frequently. Here is
an example:
http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/1679/

The offending test is root-squash-self-heal.t which starts a never-ending
glfsheal process:
  PID LID WCHAN   STAT     LTIME COMMAND
28554   5 parked  Rl     0:00.04 /build/install/sbin/glfsheal patchy 
28554   4 nanoslp Rl     0:01.28 /build/install/sbin/glfsheal patchy 
28554   3 -       Rl     0:00.00 /build/install/sbin/glfsheal patchy 
28554   1 -       Rl   754:21.27 /build/install/sbin/glfsheal patchy 

Thread 1 ate a lot of CPU time. It is looping or failed writes:
 28554      1 glfsheal CALL  __gettimeofday50(0xbf7fe650,0)
 28554      1 glfsheal RET   __gettimeofday50 0
 28554      1 glfsheal CALL  write(9,0xbb7c63fb,6)
 28554      1 glfsheal RET   write -1 errno 35 Resource temporarily
unavailable

Running a standalone glfsheal process shows it first writes "dummy" before
it hits the same error. This suggests we are in event_dispatch_destroy():

                /* Write to pipe(fd[1]) and then wait for 1 second or until
                 * a poller thread that is dying, broadcasts.
                 */
                while (event_pool->activethreadcount > 0) {
                        write (fd[1], "dummy", 6);
                        sleep_till.tv_sec = time (NULL) + 1;
                        ret = pthread_cond_timedwait (&event_pool->cond,
                                                      &event_pool->mutex,
                                                      &sleep_till);
                }

Obviously something went wrong. Perhaps there should be a timeout there,
and/or a check that write() does not fail?

diff --git a/libglusterfs/src/event.c b/libglusterfs/src/event.c
index f19d43a..b956d25 100644
--- a/libglusterfs/src/event.c
+++ b/libglusterfs/src/event.c
@@ -235,10 +235,14 @@ event_dispatch_destroy (struct event_pool *event_pool)
         pthread_mutex_lock (&event_pool->mutex);
         {
                 /* Write to pipe(fd[1]) and then wait for 1 second or until
-                 * a poller thread that is dying, broadcasts.
+                 * a poller thread that is dying, broadcasts. Make sure we
+                 * do not loop forever by limiting to 10 retries
                  */
-                while (event_pool->activethreadcount > 0) {
-                        write (fd[1], "dummy", 6);
+                int retry = 0;
+
+                while (event_pool->activethreadcount > 0 && retry++ < 10) {
+                        if (write (fd[1], "dummy", 6) == -1)
+                                break;
                         sleep_till.tv_sec = time (NULL) + 1;
                         ret = pthread_cond_timedwait (&event_pool->cond,
                                                       &event_pool->mutex,

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu at netbsd.org


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu at netbsd.org