[Gluster-devel] debugging ping timeouts

Fri Mar 21 09:25:29 UTC 2014

hi,
    I do not think glusterfs at the moment could tell why a ping-timeout happened. And by the time a user learns that such an event happened, client would have disconnected and reconnected, so we can not debug the issue any more. One of the reasons why ping-timeouts may happen is because epoll thread is busy doing something, most probably waiting on a mutex lock. So I am thinking may be we should add some extra information before and after acquiring locks and duration of critical section executions and report them at the time of disconnect.

pseudo code:

PTHREAD_MUTEX_LOCK(lock) {
     get the current time to T1;
     pthread_mutex_lock (lock);
     get the current time T2;
     if T2-T2 is greather than already recorded time update it //may be we should also remember the xlator in which it happened.
}

PTHREAD_MUTEX_UNLOCK(lock) {
     get the current time to T3;
     pthread_mutex_unlock (lock);
     if T3-T2 is greather than already recorded time update it
}

Something similar should be done for spin_locks as well.

When a disconnect event comes this information will be logged along with disconnect messages.

If you could think of anything else please add it to the thread and we will make a call after a while to see what all can be done to debug such issues further.

Pranith