[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

Fri Apr 22 07:12:47 UTC 2016

Some time ago I saw an issue with Gluster-NFS combined with disperse 
under high write load. I thought that it was already solved, but this 
issue is very similar.

The problem seemed to be related to multithreaded epoll and throttling. 
For some reason NFS was sending a massive amount of requests, ignoring 
the throttling threshold. This caused the NFS connection to be 
unresponsive. This combined with a held lock at the time of the hung 
causes it to never be released, blocking other clients.

Maybe it's not related to this problem, but I though it could be 
important to consider it.

Xavi

On 22/04/16 08:19, Ashish Pandey wrote:
>
> Hi Chen,
>
> I thought I replied to your previous mail.
> This issue has been faced by other users also. Serkan is the one if you
> follow his mail on gluster-user.
>
> I still have to dig further into it.  Soon we will try to reproduce it
> and debug it.
> My observation is that we face this issue while IO is going on and one
> of the server gets disconnect and reconnects.
> This incident might happen because of update or network issue.
> But in any way we should not come to this situation.
>
> I am adding Pranith  and Xavi who can address any unanswered queries and
> explanation.
>
> -----
> Ashish
>
> ------------------------------------------------------------------------
> *From: *"Chen Chen" <chenchen at smartquerier.com>
> *To: *"Joe Julian" <joe at julianfamily.org>, "Ashish Pandey"
> <aspandey at redhat.com>
> *Cc: *"Gluster Users" <gluster-users at gluster.org>
> *Sent: *Friday, April 22, 2016 8:28:48 AM
> *Subject: *Re: [Gluster-users] Need some help on Mismatching xdata /
> Failed combine iatt / Too many fd
>
> Hi Ashish,
>
> Are you still watching this thread? I got no response after I sent the
> info you requested. Also, could anybody explain what heal-lock is doing?
>
> I got another inode lock yesterday. Only one lock occured in the whole
> 12 bricks, yet it stopped the cluster from working again. None of my
> peer's OS is frozen, and this time "start force" worked.
>
> ------
> [xlator.features.locks.mainvol-locks.inode]
> path=<gfid:2092ae08-81de-4717-a7d5-6ad955e18b58>/NTD/variants_calling/primary_gvcf/A2612/13.g.vcf
> mandatory=0
> inodelk-count=2
> lock-dump.domain.domain=mainvol-disperse-0
> inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid =
> 1, owner=dc3dbfac887f0000, client=0x7f649835adb0,
> connection-id=hw10-6664-2016/04/17-14:47:58:6629-mainvol-client-0-0,
> granted at 2016-04-21 11:45:30
> inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid =
> 1, owner=d433bfac887f0000, client=0x7f649835adb0,
> connection-id=hw10-6664-2016/04/17-14:47:58:6629-mainvol-client-0-0,
> blocked at 2016-04-21 11:45:33
> ------
>
> I've also filed a bug report on bugzilla.
> https://bugzilla.redhat.com/show_bug.cgi?id=1329466
>
> Best regards,
> Chen
>
> On 4/13/2016 10:31 PM, Joe Julian wrote:
>  >
>  >
>  > On 04/13/2016 03:29 AM, Ashish Pandey wrote:
>  >> Hi Chen,
>  >>
>  >> What do you mean by "instantly get inode locked and teared down
>  >> the whole cluster" ? Do you mean that whole disperse volume became
>  >> unresponsive?
>  >>
>  >> I don't have much idea about features.lock-heal so can't comment how
>  >> can it help you.
>  >
>  > So who should get added to this email that would have an idea? Let's get
>  > that person looped in.
>  >
>  >>
>  >> Could you please explain second part of your mail? What exactly are
>  >> you trying to do and what is the setup?
>  >> Also volume info, logs statedumps might help.
>  >>
>  >> -----
>  >> Ashish
>  >>
>  >>
>  >> ------------------------------------------------------------------------
>  >> *From: *"Chen Chen" <chenchen at smartquerier.com>
>  >> *To: *"Ashish Pandey" <aspandey at redhat.com>
>  >> *Cc: *gluster-users at gluster.org
>  >> *Sent: *Wednesday, April 13, 2016 3:26:53 PM
>  >> *Subject: *Re: [Gluster-users] Need some help on Mismatching xdata /
>  >> Failed combine iatt / Too many fd
>  >>
>  >> Hi Ashish and other Gluster Users,
>  >>
>  >> When I put some heavy IO load onto my cluster (a rsync operation,
>  >> ~600MB/s), one of the node instantly get inode locked and teared down
>  >> the whole cluster. I've already turned on "features.lock-heal" but it
>  >> didn't help.
>  >>
>  >> My clients is using a round-robin tactic to mount servers, hoping to
>  >> average the pressure. Could it be caused by a race between NFS servers
>  >> on different nodes? Should I instead create a dedicated NFS Server with
>  >> huge memory, no brick, and multiple Ethernet cables?
>  >>
>  >> I really appreciate any help from you guys.
>  >>
>  >> Best wishes,
>  >> Chen
>  >>
>  >> PS. Don't know why the native fuse client is 5 times inferior than the
>  >> old good NFSv3.
>  >>
>  >> On 4/4/2016 6:11 PM, Ashish Pandey wrote:
>  >> > Hi Chen,
>  >> >
>  >> > As I suspected, there are many blocked call for inodelk in
>  >> sm11/mnt-disk1-mainvol.31115.dump.1459760675.
>  >> >
>  >> > =============================================
>  >> > [xlator.features.locks.mainvol-locks.inode]
>  >> > path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar
>  >> > mandatory=0
>  >> > inodelk-count=4
>  >> > lock-dump.domain.domain=mainvol-disperse-0:self-heal
>  >> > lock-dump.domain.domain=mainvol-disperse-0
>  >> > inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid
>  >> = 1, owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0,
>  >> connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0,
>  >> blocked at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58
>  >> > inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
>  >> pid = 1, owner=1414371e1a7f0000, client=0x7ff034204490,
>  >>
> connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0,
>  >> blocked at 2016-04-01 16:58:51
>  >> > inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
>  >> pid = 1, owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0,
>  >>
> connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0,
> blocked
>  >> at 2016-04-01 17:03:41
>  >> > inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
>  >> pid = 1, owner=b41a0482867f0000, client=0x7ff01800e670,
>  >>
> connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0,
>  >> blocked at 2016-04-01 17:05:09
>  >> > =============================================
>  >> >
>  >> > This could be the cause of hang.
>  >> > Possible Workaround -
>  >> > If there is no IO going on for this volume, we can restart the
>  >> volume using - gluster v start <volume-name> force. This will restart
>  >> the nfs process too which will release the locks and
>  >> > we could come out of this issue.
>  >> >
>  >> > Ashish
>  >>
>  >> --
>  >> Chen Chen
>  >> Shanghai SmartQuerier Biotechnology Co., Ltd.
>  >> Add: Add: 3F, 1278 Keyuan Road, Shanghai 201203, P. R. China
>  >> Mob: +86 15221885893
>  >> Email: chenchen at smartquerier.com
>  >> Web: www.smartquerier.com
>  >>
>  >>
>  >> _______________________________________________
>  >> Gluster-users mailing list
>  >> Gluster-users at gluster.org
>  >> http://www.gluster.org/mailman/listinfo/gluster-users
>  >>
>  >>
>  >>
>  >> _______________________________________________
>  >> Gluster-users mailing list
>  >> Gluster-users at gluster.org
>  >> http://www.gluster.org/mailman/listinfo/gluster-users
>  >
>
> --
> Chen Chen
> Shanghai SmartQuerier Biotechnology Co., Ltd.
> Add: Add: 3F, 1278 Keyuan Road, Shanghai 201203, P. R. China
> Mob: +86 15221885893
> Email: chenchen at smartquerier.com
> Web: www.smartquerier.com
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>