[Gluster-users] gluster brick hang/High CPU load after 10 hours file transfer test

Thu Nov 17 19:28:50 UTC 2016

Hi,

I encountered gluster hung after 10 hours file transfer test.

gluster3.7.14 nfs-ganesha 2.3.2

we are running on 56-cores superMicro PC.

>sudo system-docker stats gluster nfs

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O
gluster                  2694.74%            2.434 GB / 270.4 GB   0.90%               0 B / 0 B           0 B / 1.073 MB
nfs                          30.07%              146.6 MB / 270.4 GB   0.05%               0 B / 0 B           4.096 kB / 0 B

>top capture:

root     S    2556m   0%   24% /usr/local/sbin/glusterfsd -s denali-bm-qa-45 --volfile-id gluster-volume

gdb attach to some glusterfsd thread. it reported:

#0  pthread_spin_lock () at ../sysdeps/x86_64/nptl/pthread_spin_lock.S:32
#1  0x00007f945f379ae5 in pl_inode_get (this=this at entry=0x7f9460010720, inode=inode at entry=0x7f943ffe1edc) at common.c:416
#2  0x00007f945f3883be in pl_common_inodelk (frame=0x7f9467dc2ed8, this=0x7f9460010720, volume=0x7f945b5a9ac0 "gluster-volume-disperse-0", inode=0x7f943ffe1edc, cmd=6, flock=0x7f94678653d8, loc=0x7f94678652d8, fd=0x0,
    xdata=0x7f946a2e9180) at inodelk.c:743
#3  0x00007f945f388e27 in pl_inodelk (frame=<optimized out>, this=<optimized out>, volume=<optimized out>, loc=<optimized out>, cmd=<optimized out>, flock=<optimized out>, xdata=0x7f946a2e9180) at inodelk.c:816
#4  0x00007f946a00b5c6 in default_inodelk (frame=0x7f9467dc2ed8, this=0x7f9460011bf0, volume=0x7f945b5a9ac0 "gluster-volume-disperse-0", loc=0x7f94678652d8, cmd=6, lock=0x7f94678653d8, xdata=0x7f946a2e9180) at defaults.c:2032
#5  0x00007f946a01e324 in default_inodelk_resume (frame=0x7f9467dbabd4, this=0x7f9460013070, volume=0x7f945b5a9ac0 "gluster-volume-disperse-0", loc=0x7f94678652d8, cmd=6, lock=0x7f94678653d8, xdata=0x7f946a2e9180) at defaults.c:1589
#6  0x00007f946a03c1ce in call_resume_wind (stub=<optimized out>) at call-stub.c:2210
#7  0x00007f946a03c5bd in call_resume (stub=0x7f9467865298) at call-stub.c:2576
#8  0x00007f945ef5b2b2 in iot_worker (data=0x7f9460052ec0) at io-threads.c:215
#9  0x00007f946979270a in start_thread (arg=0x7f943cd5e700) at pthread_create.c:333
#10 0x00007f94694c882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

It shows that  many glusterfsd sub-threads' pthread_spin_lock wait for unlock. it caused CPU load so high.
           |-glusterfsd(772)-+-{glusterfsd}(773)
           |                 |-{glusterfsd}(774)
           |                 |-{glusterfsd}(775)
           |                 |-{glusterfsd}(776)
           |                 |-{glusterfsd}(777)
           |                 |-{glusterfsd}(778)
           |                 |-{glusterfsd}(779)
           |                 |-{glusterfsd}(780)
           |                 |-{glusterfsd}(781)
           |                 |-{glusterfsd}(782)
           |                 |-{glusterfsd}(783)
           |                 |-{glusterfsd}(784)
           |                 |-{glusterfsd}(785)
           |                 |-{glusterfsd}(786)
           |                 |-{glusterfsd}(787)
           |                 |-{glusterfsd}(788)
           |                 `-{glusterfsd}(789)
           |-glusterfsd(791)-+-{glusterfsd}(792)
           |                 |-{glusterfsd}(793)
           |                 |-{glusterfsd}(794)
           |                 |-{glusterfsd}(795)
           |                 |-{glusterfsd}(796)
           |                 |-{glusterfsd}(797)
           |                 |-{glusterfsd}(798)
           |                 |-{glusterfsd}(799)
           |                 |-{glusterfsd}(800)
           |                 |-{glusterfsd}(801)
           |                 |-{glusterfsd}(802)
           |                 |-{glusterfsd}(803)
           |                 |-{glusterfsd}(804)
           |                 |-{glusterfsd}(805)
           |                 |-{glusterfsd}(806)
           |                 |-{glusterfsd}(807)
           |                 `-{glusterfsd}(808)

If just wait for few hours, the system will recover to normal.

I am wondering how to go deeply to discover what caused one of the thread hold the lock so long. Please give me your professional advice.

Best Regards!

James Zhu
Email Disclaimer & Confidentiality Notice
This message is confidential and intended solely for the use of the recipient to whom they are addressed. If you are not the intended recipient you should not deliver, distribute or copy this e-mail. Please notify the sender immediately by e-mail and delete this e-mail from your system. Copyright © 2016 by Istuary Innovation Labs, Inc. All rights reserved.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161117/3501317a/attachment.html>