[Gluster-devel] quota.t hangs on NetBSD machines

Thu Dec 31 11:01:37 UTC 2015

On Thu, Dec 31, 2015 at 03:40:54PM +0530, Raghavendra Talur wrote:

We have threads sleeping, either voluntary (nanosleep) or not (lwp_park),
and this:

c5223a80 (glusterfs) is in 
sleepq_block/cv_timedwait_sig/sbwait/soreceive/soo_read/do_filereadv/sys_readv
Awaiting while reading on a socket. Probably FUSE, but it would be nice
to be certain. 

c5346540 (glusterfs) is in 
sleepq_block/cv_timedwait_sig/sigtimedwait1/sys_____sigtimedwait50
This is ordinary sigtimedwait() but the timeout arguent (third) is zero,
which can let it sleep forever. Is it expected?
> cv_timedwait_sig(c53466b4,c5004b80,0,c53466a4,3,db727e90,c53466a4,c41eb528,db727eac,7ff0)

c5418020 (glusterfs) is in
sleepq_block/sel_do_scan/pollcommon/sys_poll
This is orinary poll(2). The struct timespec for the timeout is at 
db721f18 and again this is an infinite timeout;
crash> x db721f18,2
db721f18:       0           0
(NB: 2 words because we run a a 32 bit machine, struct timespec is a 
32 bit time_t and a 32 bit long)

c53692c0 (perfused) is in
sleepq_block/cv_timedwait_sig/kevent1/sys___kevent50
Awaiting for data (either from kernel or glusterfs, I do not know).
Again we have an inifinite timeout.

I note that the FUSE filesystem is responding. Since perfused is
not multithreaded, it suggests it is not the stuck process. It may
have missed a request or reply, though, which would stuck the calling
process.

Speaking about the calling process. I beleive it is the quota utility?
Indeed awaiting for a reply from the filesystem:
UID   PID PPID  CPU PRI NI  VSZ  RSS WCHAN    STAT TTY       TIME COMMAND
  0 15221 1406 1546  85  0 3360 1080 puffsrpl I    pts/0- 0:00.06 tests/basic/quota /mnt/glusterfs/0/test_dir/1.txt 256 48 

Here is its backtrace obtained from gdb:
#0  0xbb69b6f7 in write () from /usr/lib/libc.so.12
#1  0x080489c0 in nwrite (fd=3, buf=0xbb501000, count=262144)
    at tests/basic/quota.c:16
#2  0x08048a8b in file_write (
    filename=0xbf7ffcb2 "/mnt/glusterfs/0/test_dir/1.txt", bs=262144, count=48)
    at tests/basic/quota.c:48
#3  0x08048b64 in main (argc=4, argv=0xbf7feba0) at tests/basic/quota.c:83

It is awaiting for a write to complete, but we still do not know what process
got the request and not the reply. Do you see any way to tell?

-- 
Emmanuel Dreyfus
manu at netbsd.org