<div dir="ltr">Hi Yuhao,<div><br></div><div>On Mon, Aug 6, 2018 at 6:57 AM Yuhao Zhang &lt;<a href="mailto:zzyzxd@gmail.com">zzyzxd@gmail.com</a>&gt; wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><div>Atin, that was my typo... I think it was glusterfsd, but not 100% sure. I will keep an eye when it happens next time.</div><div><br></div>Thank you all for looking into this! I tried another transfer earlier today but it didn&#39;t get the chance to reach the point where glusterfsd starts to fail before we needed to start the production mission critical jobs. I am going to try another time next week and report back to the thread.<div><br></div><div>Btw, I took a look into syslog and find this event happened right at the time when clients started to hang:<br><div><div><br></div></div></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142702] INFO: task glusteriotwr3:6895 blocked for more than 120 seconds.</div></div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142760]       Tainted: P           O    4.4.0-116-generic #140-Ubuntu</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142805] &quot;echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142863] glusteriotwr3   D ffff88102c18fb98     0  6895      1 0x00000000</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142872]  ffff88102c18fb98 ffff88102c18fb68 ffff88085be0b800 ffff88103f314600</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142877]  ffff88102c190000 ffff8805397fe0f4 ffff88103f314600 00000000ffffffff</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142882]  ffff8805397fe0f8 ffff88102c18fbb0 ffffffff8184ae45 ffff8805397fe0f0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142887] Call Trace:</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142899]  [&lt;ffffffff8184ae45&gt;] schedule+0x35/0x80</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142904]  [&lt;ffffffff8184b0ee&gt;] schedule_preempt_disabled+0xe/0x10</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142910]  [&lt;ffffffff8184cd29&gt;] __mutex_lock_slowpath+0xb9/0x130</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142915]  [&lt;ffffffff8184cdbf&gt;] mutex_lock+0x1f/0x30</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142922]  [&lt;ffffffff8122071d&gt;] walk_component+0x21d/0x310</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142926]  [&lt;ffffffff81221ba1&gt;] link_path_walk+0x191/0x5c0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142931]  [&lt;ffffffff8121ff3b&gt;] ? path_init+0x1eb/0x3c0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142935]  [&lt;ffffffff812220cc&gt;] path_lookupat+0x7c/0x110</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142940]  [&lt;ffffffff8123a828&gt;] ? __vfs_setxattr_noperm+0x128/0x1a0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142944]  [&lt;ffffffff81223d01&gt;] filename_lookup+0xb1/0x180</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142948]  [&lt;ffffffff8123aadb&gt;] ? setxattr+0x18b/0x200</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142955]  [&lt;ffffffff811f2109&gt;] ? kmem_cache_alloc+0x189/0x1f0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142960]  [&lt;ffffffff81223906&gt;] ? getname_flags+0x56/0x1f0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142964]  [&lt;ffffffff81223ea6&gt;] user_path_at_empty+0x36/0x40</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142970]  [&lt;ffffffff81218d76&gt;] vfs_fstatat+0x66/0xc0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142975]  [&lt;ffffffff81219331&gt;] SYSC_newlstat+0x31/0x60</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142980]  [&lt;ffffffff8121dade&gt;] ? path_put+0x1e/0x30</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142984]  [&lt;ffffffff8123ac0f&gt;] ? path_setxattr+0xbf/0xe0</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142989]  [&lt;ffffffff8121946e&gt;] SyS_newlstat+0xe/0x10</div></div></div><div><div><div>Aug  4 01:54:45 ch1prdtick03 kernel: [30601.142996]  [&lt;ffffffff8184efc8&gt;] entry_SYSCALL_64_fastpath+0x1c/0xbb</div></div></div></blockquote></div></blockquote><div><br></div><div>Which version of ZFS are you using ?</div><div><br></div><div>it seems like a hang inside ZFS.</div><br><div class="gmail_quote">Xavi</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><div><div><br></div><div><br></div><div><div><br><blockquote type="cite"><div>On Aug 5, 2018, at 08:39, Atin Mukherjee &lt;<a href="mailto:atin.mukherjee83@gmail.com" target="_blank">atin.mukherjee83@gmail.com</a>&gt; wrote:</div><br class="m_137858721029725509Apple-interchange-newline"><div><div style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br class="m_137858721029725509Apple-interchange-newline"><br><div class="gmail_quote"><div dir="ltr">On Sun, 5 Aug 2018 at 13:29, Yuhao Zhang &lt;<a href="mailto:zzyzxd@gmail.com" target="_blank">zzyzxd@gmail.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space">Sorry, what I meant was, if I start the transfer now and get glusterd into zombie status,<span class="m_137858721029725509Apple-converted-space"> </span></div></blockquote><div dir="auto"><br></div><div dir="auto">glusterd or glusterfsd?</div><div dir="auto"><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space">it&#39;s unlikely that I can fully recover the server without a reboot.</div><div style="word-wrap:break-word;line-break:after-white-space"><br><div><br><blockquote type="cite"><div>On Aug 5, 2018, at 02:55, Raghavendra Gowdappa &lt;<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>&gt; wrote:</div><br class="m_137858721029725509m_-5646314163240277706Apple-interchange-newline"><div><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Aug 5, 2018 at 1:22 PM, Yuhao Zhang<span class="m_137858721029725509Apple-converted-space"> </span><span dir="ltr">&lt;<a href="mailto:zzyzxd@gmail.com" target="_blank">zzyzxd@gmail.com</a>&gt;</span><span class="m_137858721029725509Apple-converted-space"> </span>wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space">This is a semi-production server and I can&#39;t bring it down right now. Will try to get the monitoring output when I get a chance. </div></blockquote><div><br></div><div>Collecting top output doesn&#39;t require to bring down servers.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><div><br></div><div>As I recall, the high CPU processes are brick daemons (glusterfsd) and htop showed they were in status D. However, I saw zero zpool IO as clients were all hanging.<div><div class="m_137858721029725509m_-5646314163240277706h5"><br><div><br><blockquote type="cite"><div>On Aug 5, 2018, at 02:38, Raghavendra Gowdappa &lt;<a href="mailto:rgowdapp@redhat.com" target="_blank">rgowdapp@redhat.com</a>&gt; wrote:</div><br class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-interchange-newline"><div><br class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-interchange-newline"><br style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><div class="gmail_quote" style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">On Sun, Aug 5, 2018 at 12:44 PM, Yuhao Zhang<span class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-converted-space"> </span><span dir="ltr">&lt;<a href="mailto:zzyzxd@gmail.com" target="_blank">zzyzxd@gmail.com</a>&gt;</span><span class="m_137858721029725509m_-5646314163240277706m_-5304337048872840202Apple-converted-space"> </span>wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi,<br><br>I am running into a situation that heavy write causes Gluster server went into zombie with many high CPU processes and all clients hangs, it is almost 100% reproducible on my machine. Hope someone can help.<br></blockquote><div><br></div><div>Can you give us the output of monitioring these processes with High cpu usage captured in the duration when your tests are running?<br></div><div><br></div><div><ul><li><span style="font-family:terminal,monaco,monospace">MON_INTERVAL=10 # can be increased for very long runs</span></li><li><span style="font-family:terminal,monaco,monospace">top -bd $MON_INTERVAL &gt; /tmp/top_proc.${HOSTNAME}.txt # CPU utilization by process</span></li><li><span style="font-family:terminal,monaco,monospace">top -bHd $MON_INTERVAL &gt; /tmp/top_thr.${HOSTNAME}.txt # CPU utilization by thread</span></li></ul><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><br>I started to observe this issue when running rsync to copy files from another server and I thought it might be because Gluster doesn&#39;t like rsync&#39;s delta transfer with a lot of small writes. However, I was able to reproduce this with &quot;rsync --whole-file --inplace&quot;, or even with cp or scp. It usually appears after starting the transfer for a few hours, but sometimes can happen within several minutes.<br><br>Since this is a single node Gluster distributed volume, I tried to transfer files directly onto the server bypassing Gluster clients, but it still caused the same issue.<br><br>It is running on top of a ZFS RAIDZ2 dataset. Options are attached. Also, I attached the statedump generated when my clients hung, and volume options.<br><br>- Ubuntu 16.04 x86_64 / 4.4.0-116-generic<br>- GlusterFS 3.12.8<br><br>Thank you,<br>Yuhao<br><br><br>_______________________________________________<br>Gluster-users mailing list<br><a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br><a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div></blockquote></div><br></div></div></div></div></blockquote></div><br></div></div></div></blockquote></div><br></div>_______________________________________________<br>Gluster-users mailing list<br><a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br><a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div><span style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline!important">--<span class="m_137858721029725509Apple-converted-space"> </span></span><br style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><div dir="ltr" class="m_137858721029725509gmail_signature" data-smartmail="gmail_signature" style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">--Atin</div></div></blockquote></div><br></div></div></div>_______________________________________________<br>

Gluster-users mailing list<br>

<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div></div>