[Bugs] [Bug 1330132] New: Disperse volume fails on high load and logs show some assertion failures

Mon Apr 25 12:41:28 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1330132

            Bug ID: 1330132
           Summary: Disperse volume fails on high load and logs show some
                    assertion failures
           Product: GlusterFS
           Version: 3.7.10
         Component: disperse
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: xhernandez at datalab.es
                CC: bugs at gluster.org

Description of problem:

A distributed iozone test over multiple NFS mounts on different machines causes
the test to fail and some assertion failures appear on the logs:

[2016-04-21 19:29:58.096645] E [ec-inode-read.c:1157:ec_readv_rebuild]
(-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(__ec_manager+0x5b)
[0x7f9e4e8f18bb]
-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_readv+0x107)
[0x7f9e4e908197]
-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_readv_rebuild+0x236)
[0x7f9e4e907f26] ) 0-: Assertion failed: ec_get_inode_size(fop, fop->fd->inode,
&cbk->iatt[0].ia_size)
[2016-04-21 19:29:58.126547] E [ec-common.c:1641:ec_lock_unfreeze]
(-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_inodelk+0x155)
[0x7f9e4e8fc305]
-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_unlocked+0x35)
[0x7f9e4e8f3c25]
-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_lock_unfreeze+0x100)
[0x7f9e4e8f3ab0] ) 0-: Assertion failed: list_empty(&lock->waiting) &&
list_empty(&lock->owners)
[2016-04-21 19:30:05.998568] E [ec-inode-read.c:1612:ec_manager_stat]
(-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_resume+0x88)
[0x7f9e4e8f1a68]
-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(__ec_manager+0x5b)
[0x7f9e4e8f18bb]
-->/usr/lib64/glusterfs/3.7.10/xlator/cluster/disperse.so(ec_manager_stat+0x315)
[0x7f9e4e905ed5] ) 0-: Assertion failed: ec_get_inode_size(fop,
fop->locks[0].lock->loc.inode, &cbk->iatt[0].ia_size)
[2016-04-21 19:30:05.999146] E [MSGID: 114031]
[client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-8: remote
operation failed [Invalid argument]
[2016-04-21 19:30:05.999132] E [MSGID: 114031]
[client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-10: remote
operation failed [Invalid argument]
[2016-04-21 19:30:05.999237] E [MSGID: 114031]
[client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-11: remote
operation failed [Invalid argument]
[2016-04-21 19:30:05.999259] E [MSGID: 114031]
[client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-7: remote
operation failed [Invalid argument]
[2016-04-21 19:30:05.999326] E [MSGID: 114031]
[client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-9: remote
operation failed [Invalid argument]
[2016-04-21 19:30:06.047496] E [MSGID: 114031]
[client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-test-client-6: remote
operation failed [Invalid argument]
[2016-04-21 19:30:06.047559] W [MSGID: 122015] [ec-common.c:1675:ec_unlocked]
0-test-disperse-1: entry/inode unlocking failed (FSTAT) [Invalid argument]

Version-Release number of selected component (if applicable): mainline

How reproducible:

It happens randomly after some time running the distributed iozone test.

Steps to Reproduce:
1.
2.
3.

Actual results:

Volume access fails and iozone quits with an error.

Expected results:

iozone should complete the test successfully.

Additional info:

Probably related to a race when cancelling the lock release timeout while the
callback is already executing. In this case the new fop is not placed in the
right waiting list.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.