<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Sorry, what I meant was, if I start the transfer now and get glusterd into zombie status, it's unlikely that I can fully recover the server without a reboot.<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 5, 2018, at 02:55, Raghavendra Gowdappa &lt;<a href="mailto:rgowdapp@redhat.com" class="">rgowdapp@redhat.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><br class=""><div class="gmail_extra"><br class=""><div class="gmail_quote">On Sun, Aug 5, 2018 at 1:22 PM, Yuhao Zhang <span dir="ltr" class="">&lt;<a href="mailto:zzyzxd@gmail.com" target="_blank" class="">zzyzxd@gmail.com</a>&gt;</span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space" class="">This is a semi-production server and I can't bring it down right now. Will try to get the monitoring output when I get a chance.&nbsp;</div></blockquote><div class=""><br class=""></div><div class="">Collecting top output doesn't require to bring down servers.</div><div class=""> <br class=""></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space" class=""><div class=""><br class=""></div><div class="">As I recall, the high CPU processes are brick daemons (glusterfsd) and htop showed they were in status D. However, I saw zero zpool IO as clients were all hanging.<div class=""><div class="h5"><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On Aug 5, 2018, at 02:38, Raghavendra Gowdappa &lt;<a href="mailto:rgowdapp@redhat.com" target="_blank" class="">rgowdapp@redhat.com</a>&gt; wrote:</div><br class="m_-5304337048872840202Apple-interchange-newline"><div class=""><br class="m_-5304337048872840202Apple-interchange-newline"><br style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none" class=""><div class="gmail_quote" style="font-family:Helvetica;font-size:14px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">On Sun, Aug 5, 2018 at 12:44 PM, Yuhao Zhang<span class="m_-5304337048872840202Apple-converted-space">&nbsp;</span><span dir="ltr" class="">&lt;<a href="mailto:zzyzxd@gmail.com" target="_blank" class="">zzyzxd@gmail.com</a>&gt;</span><span class="m_-5304337048872840202Apple-converted-space">&nbsp;</span>wrote<wbr class="">:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi,<br class=""><br class="">I am running into a situation that heavy write causes Gluster server went into zombie with many high CPU processes and all clients hangs, it is almost 100% reproducible on my machine. Hope someone can help.<br class=""></blockquote><div class=""><br class=""></div><div class="">Can you give us the output of monitioring these processes with High cpu usage captured in the duration when your tests are running?<br class=""></div><div class=""><br class=""></div><div class=""><li class=""><span style="font-family:terminal,monaco,monospace" class="">MON_INTERVAL=10 # can be increased for very long runs</span></li><li class=""><span style="font-family:terminal,monaco,monospace" class="">top -bd $MON_INTERVAL &gt; /tmp/top_proc.${HOSTNAME}.txt # CPU utilization by process</span></li><li class=""><span style="font-family:terminal,monaco,monospace" class="">top -bHd&nbsp;$MON_INTERVAL &gt; /tmp/top_thr.${HOSTNAME}.txt # CPU utilization by thread</span></li><br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><br class="">I started to observe this issue when running rsync to copy files from another server and I thought it might be because Gluster doesn't like rsync's delta transfer with a lot of small writes. However, I was able to reproduce this with "rsync --whole-file --inplace", or even with cp or scp. It usually appears after starting the transfer for a few hours, but sometimes can happen within several minutes.<br class=""><br class="">Since this is a single node Gluster distributed volume, I tried to transfer files directly onto the server bypassing Gluster clients, but it still caused the same issue.<br class=""><br class="">It is running on top of a ZFS RAIDZ2 dataset. Options are attached. Also, I attached the statedump generated when my clients hung, and volume options.<br class=""><br class="">- Ubuntu 16.04 x86_64 / 4.4.0-116-generic<br class="">- GlusterFS 3.12.8<br class=""><br class="">Thank you,<br class="">Yuhao<br class=""><br class=""><br class="">______________________________<wbr class="">_________________<br class="">Gluster-users mailing list<br class=""><a href="mailto:Gluster-users@gluster.org" target="_blank" class="">Gluster-users@gluster.org</a><br class=""><a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank" class="">https://lists.gluster.org/mail<wbr class="">man/listinfo/gluster-users</a></blockquote></div></div></blockquote></div><br class=""></div></div></div></div></blockquote></div><br class=""></div></div>

</div></blockquote></div><br class=""></body></html>