<div dir="ltr"><pre><font color="#000000"><span style="white-space:pre-wrap">Hi,

As per attached glusterdump/stackdump  it seems it is a known issue (<a href="https://bugzilla.redhat.com/show_bug.cgi?id=1372211">https://bugzilla.redhat.com/show_bug.cgi?id=1372211</a>) and issue is already fixed from the patch (<a href="https://review.gluster.org/#/c/15380/">https://review.gluster.org/#/c/15380/</a>).

The issue is happened in this case 

Assume a file is opened with fd1 and fd2.

1. some WRITE opto fd1 got error, they were add back to &#39;todo&#39; queue

   because of those error.

2. fd2 closed, a FLUSH op is send to write-behind.

3. FLUSH can not be unwind because it&#39;s not a legal waiter for those

   failed write(as func __wb_request_waiting_on() say). and those failed

   WRITE also can not be ended if fd1 is not closed. fd2 stuck in close

   syscall.

As per statedump it also shows flush op fd is not same as write op fd.

Kindly upgrade the package on 3.10.1 and share the result. 

Thanks

Mohit Agrawal </pre><pre style="white-space:pre-wrap;color:rgb(0,0,0)"> </pre><pre style="white-space:pre-wrap;color:rgb(0,0,0)">On Fri, Mar 31, 2017 at 12:29 PM, Amar Tumballi &lt;<a href="http://lists.gluster.org/mailman/listinfo/gluster-users">atumball at redhat.com</a>&gt; wrote:

&gt;<i> Hi Alvin,

</i>&gt;<i>

&gt; Thanks for the dump output. It helped a bit.

</i>&gt;<i>

</i>&gt;<i> For now, recommend turning off open-behind and read-ahead performance

</i>&gt;<i> translators for you to get rid of this situation, As I noticed hung FLUSH

&gt; operations from these translators.

</i>&gt;<i>

</i>

Looks like I gave wrong advise by looking at below snippet:

[global.callpool.stack.61]

&gt;<i> stack=0x7f6c6f628f04

</i>&gt;<i> uid=48

</i>&gt;<i> gid=48

</i>&gt;<i> pid=11077

</i>&gt;<i> unique=10048797

</i>&gt;<i> lk-owner=a73ae5bdb5fcd0d2

</i>&gt;<i> op=FLUSH

</i>&gt;<i> type=1

</i>&gt;<i> cnt=5

</i>&gt;<i>

</i>&gt;<i> [global.callpool.stack.61.frame.1]

</i>&gt;<i> frame=0x7f6c6f793d88

</i>&gt;<i> ref_count=0

</i>&gt;<i> translator=edocs-production-write-behind

</i>&gt;<i> complete=0

</i>&gt;<i> parent=edocs-production-read-ahead

</i>&gt;<i> wind_from=ra_flush

</i>&gt;<i> wind_to=FIRST_CHILD (this)-&gt;fops-&gt;flush

</i>&gt;<i> unwind_to=ra_flush_cbk

</i>&gt;<i>

</i>&gt;<i> [global.callpool.stack.61.frame.2]

</i>&gt;<i> frame=0x7f6c6f796c90

</i>&gt;<i> ref_count=1

</i>&gt;<i> translator=edocs-production-read-ahead

</i>&gt;<i> complete=0

</i>&gt;<i> parent=edocs-production-open-behind

</i>&gt;<i> wind_from=default_flush_resume

</i>&gt;<i> wind_to=FIRST_CHILD(this)-&gt;fops-&gt;flush

</i>&gt;<i> unwind_to=default_flush_cbk

</i>&gt;<i>

</i>&gt;<i> [global.callpool.stack.61.frame.3]

</i>&gt;<i> frame=0x7f6c6f79b724

</i>&gt;<i> ref_count=1

</i>&gt;<i> translator=edocs-production-open-behind

</i>&gt;<i> complete=0

</i>&gt;<i> parent=edocs-production

</i>&gt;<i> wind_from=io_stats_flush

</i>&gt;<i> wind_to=FIRST_CHILD(this)-&gt;fops-&gt;flush

</i>&gt;<i> unwind_to=io_stats_flush_cbk

</i>&gt;<i>

</i>&gt;<i> [global.callpool.stack.61.frame.4]

</i>&gt;<i> frame=0x7f6c6f79b474

</i>&gt;<i> ref_count=1

</i>&gt;<i> translator=edocs-production

</i>&gt;<i> complete=0

</i>&gt;<i> parent=fuse

</i>&gt;<i> wind_from=fuse_flush_resume

</i>&gt;<i> wind_to=FIRST_CHILD(this)-&gt;fops-&gt;flush

</i>&gt;<i> unwind_to=fuse_err_cbk

</i>&gt;<i>

</i>&gt;<i> [global.callpool.stack.61.frame.5]

</i>&gt;<i> frame=0x7f6c6f796684

</i>&gt;<i> ref_count=1

</i>&gt;<i> translator=fuse

</i>&gt;<i> complete=0

</i>&gt;<i>

</i>

Mos probably, issue is with write-behind&#39;s flush. So please turn off

write-behind and test. If you don&#39;t have any hung httpd processes, please

let us know.

-Amar

&gt;<i> -Amar

</i>&gt;<i>

</i>&gt;<i> On Wed, Mar 29, 2017 at 6:56 AM, Alvin Starr &lt;<a href="http://lists.gluster.org/mailman/listinfo/gluster-users">alvin at netvel.net</a>&gt; wrote:

</i>&gt;<i>

</i>&gt;&gt;<i> We are running gluster 3.8.9-1 on Centos 7.3.1611 for the servers and on

</i>&gt;&gt;<i> the clients 3.7.11-2 on Centos 6.8

</i>&gt;&gt;<i>

&gt;&gt; We are seeing httpd processes hang in fuse_request_send or sync_page.

</i>&gt;&gt;<i>

</i>&gt;&gt;<i> These calls are from PHP 5.3.3-48 scripts

</i>&gt;&gt;<i>

</i>&gt;&gt;<i> I am attaching  a tgz file that contains the process dump from glusterfsd

</i>&gt;&gt;<i> and the hung pids along with the offending pid&#39;s stacks from

</i>&gt;&gt;<i> /proc/{pid}/stack.

</i>&gt;&gt;<i>

</i>&gt;&gt;<i> This has been a low level annoyance for a while but it has become a much

</i>&gt;&gt;<i> bigger issue because the number of hung processes went from a few a week to

&gt;&gt; a few hundred a day.

</i>&gt;&gt;<i>

</i>&gt;&gt;<i>

</i>&gt;&gt;<i> --

</i>&gt;&gt;<i> Alvin Starr                   ||   voice: (905)513-7688

</i>&gt;&gt;<i> Netvel Inc.                   ||   Cell:  (416)806-0133

</i>&gt;&gt;<i> <a href="http://lists.gluster.org/mailman/listinfo/gluster-users">alvin at netvel.net</a>              ||

</i>&gt;&gt;<i>

</i>&gt;&gt;<i>

</i>&gt;&gt;<i> _______________________________________________

</i>&gt;&gt;<i> Gluster-users mailing list

</i>&gt;&gt;<i> <a href="http://lists.gluster.org/mailman/listinfo/gluster-users">Gluster-users at gluster.org</a>

</i>&gt;&gt;<i> <a href="http://lists.gluster.org/mailman/listinfo/gluster-users">http://lists.gluster.org/mailman/listinfo/gluster-users</a>

</i>&gt;&gt;<i>

</i>&gt;<i>

</i>&gt;<i>

</i>&gt;<i>

</i>&gt;<i> --

</i>&gt;<i> Amar Tumballi (amarts)

</i>&gt;<i>

</i>

-- </pre></div>