<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Also, when it happened, Gluster CPU usage instantly jumped up and all the writes started to hang. Below is my server monitoring on Grafana (I have a telegraf agent running on the server to collect metrics.) The Gluster CPU usage is the usage of all gluster related processes(telegraf does a regex search to match all processes contain "gluster")<div class=""><br class=""></div><div class=""><img apple-inline="yes" id="2AB2CA8B-2F2E-4B88-8FFA-93A074BF0EAA" width="434" height="584" src="cid:DFC0E1B5-07C2-45A3-A7ED-C6592BB8BCA5@akunacapital.local" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 6, 2018, at 08:26, Yuhao Zhang <<a href="mailto:zzyzxd@gmail.com" class="">zzyzxd@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class="">Hello,</div><div class=""><br class=""></div>I just experienced another hanging one hour ago and the server was not even under heavy IO.<div class=""><br class=""></div><div class="">Atin, I attached the process monitoring results and another statedump.</div><div class=""><br class=""></div><div class="">Xavi, ZFS was fine, during the hanging, I can still write directly to the ZFS volume. My ZFS version: ZFS: Loaded module v0.6.5.6-0ubuntu16, ZFS pool version 5000, ZFS filesystem version 5</div><div class=""><div class=""><br class=""></div><div class="">Thank you,</div><div class="">Yuhao</div><div class=""></div></div></div><span id="cid:A196C013-6777-4A12-9F43-2EAFDCA21BDC@akunacapital.local"><syslog.1></span><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div class=""><div class=""></div></div></div><span id="cid:0B074854-5BC5-4E22-959E-DE60BADBF82D@akunacapital.local"><top_proc.txt></span><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div class=""><div class=""></div></div></div><span id="cid:14E04B8E-8C45-4A0A-B75D-53C9AF8EDB4B@akunacapital.local"><top_thr.txt></span><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div class=""><div class=""></div></div></div><span id="cid:BA1CEBAE-17FA-4E34-87AC-554F94BB116A@akunacapital.local"><zfs-vol.6855.dump.1533559692></span><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Aug 6, 2018, at 02:03, Xavi Hernandez <<a href="mailto:jahernan@redhat.com" class="">jahernan@redhat.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class="">Hi Yuhao,<div class=""><br class=""></div><div class="">On Mon, Aug 6, 2018 at 6:57 AM Yuhao Zhang <<a href="mailto:zzyzxd@gmail.com" class="">zzyzxd@gmail.com</a>> wrote:<br class=""></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><div style="word-wrap: break-word; line-break: after-white-space;" class=""><div class="">Atin, that was my typo... I think it was glusterfsd, but not 100% sure. I will keep an eye when it happens next time.</div><div class=""><br class=""></div>Thank you all for looking into this! I tried another transfer earlier today but it didn't get the chance to reach the point where glusterfsd starts to fail before we needed to start the production mission critical jobs. I am going to try another time next week and report back to the thread.<div class=""><br class=""></div><div class="">Btw, I took a look into syslog and find this event happened right at the time when clients started to hang:<br class=""><div class=""><div class=""><br class=""></div></div></div><blockquote style="margin: 0px 0px 0px 40px; border: none; padding: 0px;" class=""><div class=""><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142702] INFO: task glusteriotwr3:6895 blocked for more than 120 seconds.</div></div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142760] Tainted: P O 4.4.0-116-generic #140-Ubuntu</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142863] glusteriotwr3 D ffff88102c18fb98 0 6895 1 0x00000000</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142872] ffff88102c18fb98 ffff88102c18fb68 ffff88085be0b800 ffff88103f314600</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142877] ffff88102c190000 ffff8805397fe0f4 ffff88103f314600 00000000ffffffff</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142882] ffff8805397fe0f8 ffff88102c18fbb0 ffffffff8184ae45 ffff8805397fe0f0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142887] Call Trace:</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142899] [<ffffffff8184ae45>] schedule+0x35/0x80</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142904] [<ffffffff8184b0ee>] schedule_preempt_disabled+0xe/0x10</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142910] [<ffffffff8184cd29>] __mutex_lock_slowpath+0xb9/0x130</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142915] [<ffffffff8184cdbf>] mutex_lock+0x1f/0x30</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142922] [<ffffffff8122071d>] walk_component+0x21d/0x310</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142926] [<ffffffff81221ba1>] link_path_walk+0x191/0x5c0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142931] [<ffffffff8121ff3b>] ? path_init+0x1eb/0x3c0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142935] [<ffffffff812220cc>] path_lookupat+0x7c/0x110</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142940] [<ffffffff8123a828>] ? __vfs_setxattr_noperm+0x128/0x1a0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142944] [<ffffffff81223d01>] filename_lookup+0xb1/0x180</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142948] [<ffffffff8123aadb>] ? setxattr+0x18b/0x200</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142955] [<ffffffff811f2109>] ? kmem_cache_alloc+0x189/0x1f0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142960] [<ffffffff81223906>] ? getname_flags+0x56/0x1f0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142964] [<ffffffff81223ea6>] user_path_at_empty+0x36/0x40</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142970] [<ffffffff81218d76>] vfs_fstatat+0x66/0xc0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142975] [<ffffffff81219331>] SYSC_newlstat+0x31/0x60</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142980] [<ffffffff8121dade>] ? path_put+0x1e/0x30</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142984] [<ffffffff8123ac0f>] ? path_setxattr+0xbf/0xe0</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142989] [<ffffffff8121946e>] SyS_newlstat+0xe/0x10</div></div></div><div class=""><div class=""><div class="">Aug 4 01:54:45 ch1prdtick03 kernel: [30601.142996] [<ffffffff8184efc8>] entry_SYSCALL_64_fastpath+0x1c/0xbb</div></div></div></blockquote></div></blockquote><div class=""><br class=""></div><div class="">Which version of ZFS are you using ?</div><div class=""><br class=""></div><div class="">it seems like a hang inside ZFS.</div><br class=""><div class="gmail_quote">Xavi</div><div class=""> </div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><div style="word-wrap: break-word; line-break: after-white-space;" class=""><div class=""><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Aug 5, 2018, at 08:39, Atin Mukherjee <<a href="mailto:atin.mukherjee83@gmail.com" target="_blank" class="">atin.mukherjee83@gmail.com</a>> wrote:</div><br class="m_137858721029725509Apple-interchange-newline"><div class=""><div style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; text-decoration: none;" class=""><br class="m_137858721029725509Apple-interchange-newline"><br class=""><div class="gmail_quote"><div dir="ltr" class="">On Sun, 5 Aug 2018 at 13:29, Yuhao Zhang <<a href="mailto:zzyzxd@gmail.com" target="_blank" class="">zzyzxd@gmail.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><div style="word-wrap: break-word; line-break: after-white-space;" class="">Sorry, what I meant was, if I start the transfer now and get glusterd into zombie status,<span class="m_137858721029725509Apple-converted-space"> </span></div></blockquote><div dir="auto" class=""><br class=""></div><div dir="auto" class="">glusterd or glusterfsd?</div><div dir="auto" class=""><br class=""></div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><div style="word-wrap: break-word; line-break: after-white-space;" class="">it's unlikely that I can fully recover the server without a reboot.</div><div style="word-wrap: break-word; line-break: after-white-space;" class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Aug 5, 2018, at 02:55, Raghavendra Gowdappa <<a href="mailto:rgowdapp@redhat.com" target="_blank" class="">rgowdapp@redhat.com</a>> wrote:</div><br class="m_137858721029725509m_-5646314163240277706Apple-interchange-newline"><div class=""><div dir="ltr" class=""><br class=""><div class="gmail_extra"><br class=""><div class="gmail_quote">On Sun, Aug 5, 2018 at 1:22 PM, Yuhao Zhang<span class="m_137858721029725509Apple-converted-space"> </span><span dir="ltr" class=""><<a href="mailto:zzyzxd@gmail.com" target="_blank" class="">zzyzxd@gmail.com</a>></span><span class="m_137858721029725509Apple-converted-space"> </span>wrote:<br class=""><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><div style="word-wrap: break-word; line-break: after-white-space;" class="">This is a semi-production server and I can't bring it down right now. Will try to get the monitoring output when I get a chance. </div></blockquote><div class=""><br class=""></div><div class="">Collecting top output doesn't require to bring down servers.</div><div class=""><br class=""></div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><div style="word-wrap: break-word; line-break: after-white-space;" class=""><div class=""><br class=""></div><div class="">As I recall, the high CPU processes are brick daemons (glusterfsd) and htop showed they were in status D. However, I saw zero zpool IO as clients were all hanging.<div class=""><div class="m_137858721029725509m_-5646314163240277706h5"><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Aug 5, 2018, at 02:38, Raghavendra Gowdappa <<a href="mailto:rgowdapp@redhat.com" target="_blank" class="">rgowdapp@redhat.com</a>> wrote:</div><br class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-interchange-newline"><div class=""><br class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-interchange-newline"><br style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; text-decoration: none;" class=""><div class="gmail_quote" style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; text-decoration: none;">On Sun, Aug 5, 2018 at 12:44 PM, Yuhao Zhang<span class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-converted-space"> </span><span dir="ltr" class=""><<a href="mailto:zzyzxd@gmail.com" target="_blank" class="">zzyzxd@gmail.com</a>></span><span class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-converted-space"> </span>wrote:<br class=""><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">Hi,<br class=""><br class="">I am running into a situation that heavy write causes Gluster server went into zombie with many high CPU processes and all clients hangs, it is almost 100% reproducible on my machine. Hope someone can help.<br class=""></blockquote><div class=""><br class=""></div><div class="">Can you give us the output of monitioring these processes with High cpu usage captured in the duration when your tests are running?<br class=""></div><div class=""><br class=""></div><div class=""><ul class=""><li class=""><span style="font-family: terminal, monaco, monospace;" class="">MON_INTERVAL=10 # can be increased for very long runs</span></li><li class=""><span style="font-family: terminal, monaco, monospace;" class="">top -bd $MON_INTERVAL > /tmp/top_proc.${HOSTNAME}.txt # CPU utilization by process</span></li><li class=""><span style="font-family: terminal, monaco, monospace;" class="">top -bHd $MON_INTERVAL > /tmp/top_thr.${HOSTNAME}.txt # CPU utilization by thread</span></li></ul><br class=""></div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><br class="">I started to observe this issue when running rsync to copy files from another server and I thought it might be because Gluster doesn't like rsync's delta transfer with a lot of small writes. However, I was able to reproduce this with "rsync --whole-file --inplace", or even with cp or scp. It usually appears after starting the transfer for a few hours, but sometimes can happen within several minutes.<br class=""><br class="">Since this is a single node Gluster distributed volume, I tried to transfer files directly onto the server bypassing Gluster clients, but it still caused the same issue.<br class=""><br class="">It is running on top of a ZFS RAIDZ2 dataset. Options are attached. Also, I attached the statedump generated when my clients hung, and volume options.<br class=""><br class="">- Ubuntu 16.04 x86_64 / 4.4.0-116-generic<br class="">- GlusterFS 3.12.8<br class=""><br class="">Thank you,<br class="">Yuhao<br class=""><br class=""><br class="">_______________________________________________<br class="">Gluster-users mailing list<br class=""><a href="mailto:Gluster-users@gluster.org" target="_blank" class="">Gluster-users@gluster.org</a><br class=""><a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank" class="">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div></blockquote></div><br class=""></div></div></div></div></blockquote></div><br class=""></div></div></div></blockquote></div><br class=""></div>_______________________________________________<br class="">Gluster-users mailing list<br class=""><a href="mailto:Gluster-users@gluster.org" target="_blank" class="">Gluster-users@gluster.org</a><br class=""><a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank" class="">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div><span style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; text-decoration: none; float: none; display: inline !important;" class="">--<span class="m_137858721029725509Apple-converted-space"> </span></span><br style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; text-decoration: none;" class=""><div dir="ltr" class="m_137858721029725509gmail_signature" data-smartmail="gmail_signature" style="font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; text-decoration: none;">--Atin</div></div></blockquote></div><br class=""></div></div></div>_______________________________________________<br class="">Gluster-users mailing list<br class=""><a href="mailto:Gluster-users@gluster.org" target="_blank" class="">Gluster-users@gluster.org</a><br class=""><a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank" class="">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div></div></blockquote></div><br class=""></div></div></div></div></blockquote></div><br class=""></div></body></html>