[Gluster-users] write request hung in write-behind

Tue Jun 4 02:06:24 UTC 2019

To me, all 'df' commands on specific(not all) nfs client hung forever.  The temporary solution is disable performance.nfs.write-behind and cluster.eager-lock. 

I'll try to get more info back if encounter this problem again .

发件人: Raghavendra Gowdappa

时间: 2019/06/04(星期二)09:55

收件人: Xie Changlong;Ravishankar Narayanankutty;Karampuri, Pranith;

抄送人: gluster-users;

主题: Re: Re: write request hung in write-behind

On Mon, Jun 3, 2019 at 1:11 PM Xie Changlong <zgrep at 139.com> wrote:

Firstly i correct myself, write request followed by 771(not 1545) FLUSH requests.  I've attach gnfs dump file, totally 774 pending call-stacks,

771 of them pending on write-behind and the deepest call-stack is afr.

+Ravishankar Narayanankutty +Karampuri, Pranith 

Are you sure these were not call-stacks of in-progress ops? One way of confirming that would be to take statedumps periodically (say 3 min apart). Hung call stacks will be common to all the statedumps.

[global.callpool.stack.771]

stack=0x7f517f557f60

uid=0

gid=0

pid=0

unique=0

lk-owner=

op=stack

type=0

cnt=3

[global.callpool.stack.771.frame.1]

frame=0x7f517f655880

ref_count=0

translator=cl35vol01-replicate-7

complete=0

parent=cl35vol01-dht

wind_from=dht_writev

wind_to=subvol->fops->writev

unwind_to=dht_writev_cbk

[global.callpool.stack.771.frame.2]

frame=0x7f518ed90340

ref_count=1

translator=cl35vol01-dht

complete=0

parent=cl35vol01-write-behind

wind_from=wb_fulfill_head

wind_to=FIRST_CHILD (frame->this)->fops->writev

unwind_to=wb_fulfill_cbk

[global.callpool.stack.771.frame.3]

frame=0x7f516d3baf10

ref_count=1

translator=cl35vol01-write-behind

complete=0

[global.callpool.stack.772]

stack=0x7f51607a5a20

uid=0

gid=0

pid=0

unique=0

lk-owner=a0715b77517f0000

op=stack

type=0

cnt=1

[global.callpool.stack.772.frame.1]

frame=0x7f516ca2d1b0

ref_count=0

translator=cl35vol01-replicate-7

complete=0

[root at rhel-201 35]# grep -rn "global.callpool.stack.*.frame.1" -A 5  glusterdump.20106.dump.1559038081  |grep translator | wc -l

774

[root at rhel-201 35]# grep -rn "global.callpool.stack.*.frame.1" -A 5  glusterdump.20106.dump.1559038081 |grep complete |wc -l

774

[root at rhel-201 35]# grep -rn "global.callpool.stack.*.frame.1" -A 5  glusterdump.20106.dump.1559038081 |grep -E "complete=0" |wc -l

774

[root at rhel-201 35]# grep -rn "global.callpool.stack.*.frame.1" -A 5  glusterdump.20106.dump.1559038081  |grep translator | grep write-behind |wc -l

771

[root at rhel-201 35]# grep -rn "global.callpool.stack.*.frame.1" -A 5  glusterdump.20106.dump.1559038081  |grep translator | grep replicate-7 | wc -l

2

[root at rhel-201 35]# grep -rn "global.callpool.stack.*.frame.1" -A 5  glusterdump.20106.dump.1559038081  |grep translator | grep glusterfs | wc -l

1

发件人: Raghavendra Gowdappa

时间: 2019/06/03(星期一)14:46

收件人: Xie Changlong;

抄送人: gluster-users;

主题: Re: write request hung in write-behind

On Mon, Jun 3, 2019 at 11:57 AM Xie Changlong <zgrep at 139.com> wrote:

Hi all

Test gluster 3.8.4-54.15 gnfs, i saw a write request hung in write-behind followed by 1545 FLUSH requests. I found a similar

bugfix https://bugzilla.redhat.com/show_bug.cgi?id=1626787, but not sure if it's the right one. 

[xlator.performance.write-behind.wb_inode]

path=/575/1e/5751e318f21f605f2aac241bf042e7a8.jpg

inode=0x7f51775b71a0

window_conf=1073741824

window_current=293822

transit-size=293822

dontsync=0

[.WRITE]

request-ptr=0x7f516eec2060

refcount=1

wound=yes

generation-number=1

req->op_ret=293822

req->op_errno=0

sync-attempts=1

sync-in-progress=yes

Note that the sync is still in progress. This means, write-behind has wound the write-request to its children and yet to receive the response (unless there is a bug in accounting of sync-in-progress). So, its likely that there are callstacks into children of write-behind, which are not complete yet. Are you sure the deepest hung call-stack is in write-behind? Can you check for frames with "complete=0"? 

size=293822

offset=1048576

lied=-1

append=0

fulfilled=0

go=-1

[.FLUSH]

request-ptr=0x7f517c2badf0

refcount=1

wound=no

generation-number=2

req->op_ret=-1

req->op_errno=116

sync-attempts=0

[.FLUSH]

request-ptr=0x7f5173e9f7b0

refcount=1

wound=no

generation-number=2

req->op_ret=0

req->op_errno=0

sync-attempts=0

[.FLUSH]

request-ptr=0x7f51640b8ca0

refcount=1

wound=no

generation-number=2

req->op_ret=0

req->op_errno=0

sync-attempts=0

[.FLUSH]

request-ptr=0x7f516f3979d0

refcount=1

wound=no

generation-number=2

req->op_ret=0

req->op_errno=0

sync-attempts=0

[.FLUSH]

request-ptr=0x7f516f6ac8d0

refcount=1

wound=no

generation-number=2

req->op_ret=0

req->op_errno=0

sync-attempts=0

Any comments would be appreciated!

Thanks

-Xie

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190604/0877d91d/attachment.html>