[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd
Xavier Hernandez
xhernandez at datalab.es
Fri Apr 22 07:12:47 UTC 2016
Some time ago I saw an issue with Gluster-NFS combined with disperse
under high write load. I thought that it was already solved, but this
issue is very similar.
The problem seemed to be related to multithreaded epoll and throttling.
For some reason NFS was sending a massive amount of requests, ignoring
the throttling threshold. This caused the NFS connection to be
unresponsive. This combined with a held lock at the time of the hung
causes it to never be released, blocking other clients.
Maybe it's not related to this problem, but I though it could be
important to consider it.
Xavi
On 22/04/16 08:19, Ashish Pandey wrote:
>
> Hi Chen,
>
> I thought I replied to your previous mail.
> This issue has been faced by other users also. Serkan is the one if you
> follow his mail on gluster-user.
>
> I still have to dig further into it. Soon we will try to reproduce it
> and debug it.
> My observation is that we face this issue while IO is going on and one
> of the server gets disconnect and reconnects.
> This incident might happen because of update or network issue.
> But in any way we should not come to this situation.
>
> I am adding Pranith and Xavi who can address any unanswered queries and
> explanation.
>
> -----
> Ashish
>
> ------------------------------------------------------------------------
> *From: *"Chen Chen" <chenchen at smartquerier.com>
> *To: *"Joe Julian" <joe at julianfamily.org>, "Ashish Pandey"
> <aspandey at redhat.com>
> *Cc: *"Gluster Users" <gluster-users at gluster.org>
> *Sent: *Friday, April 22, 2016 8:28:48 AM
> *Subject: *Re: [Gluster-users] Need some help on Mismatching xdata /
> Failed combine iatt / Too many fd
>
> Hi Ashish,
>
> Are you still watching this thread? I got no response after I sent the
> info you requested. Also, could anybody explain what heal-lock is doing?
>
> I got another inode lock yesterday. Only one lock occured in the whole
> 12 bricks, yet it stopped the cluster from working again. None of my
> peer's OS is frozen, and this time "start force" worked.
>
> ------
> [xlator.features.locks.mainvol-locks.inode]
> path=<gfid:2092ae08-81de-4717-a7d5-6ad955e18b58>/NTD/variants_calling/primary_gvcf/A2612/13.g.vcf
> mandatory=0
> inodelk-count=2
> lock-dump.domain.domain=mainvol-disperse-0
> inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid =
> 1, owner=dc3dbfac887f0000, client=0x7f649835adb0,
> connection-id=hw10-6664-2016/04/17-14:47:58:6629-mainvol-client-0-0,
> granted at 2016-04-21 11:45:30
> inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid =
> 1, owner=d433bfac887f0000, client=0x7f649835adb0,
> connection-id=hw10-6664-2016/04/17-14:47:58:6629-mainvol-client-0-0,
> blocked at 2016-04-21 11:45:33
> ------
>
> I've also filed a bug report on bugzilla.
> https://bugzilla.redhat.com/show_bug.cgi?id=1329466
>
> Best regards,
> Chen
>
> On 4/13/2016 10:31 PM, Joe Julian wrote:
> >
> >
> > On 04/13/2016 03:29 AM, Ashish Pandey wrote:
> >> Hi Chen,
> >>
> >> What do you mean by "instantly get inode locked and teared down
> >> the whole cluster" ? Do you mean that whole disperse volume became
> >> unresponsive?
> >>
> >> I don't have much idea about features.lock-heal so can't comment how
> >> can it help you.
> >
> > So who should get added to this email that would have an idea? Let's get
> > that person looped in.
> >
> >>
> >> Could you please explain second part of your mail? What exactly are
> >> you trying to do and what is the setup?
> >> Also volume info, logs statedumps might help.
> >>
> >> -----
> >> Ashish
> >>
> >>
> >> ------------------------------------------------------------------------
> >> *From: *"Chen Chen" <chenchen at smartquerier.com>
> >> *To: *"Ashish Pandey" <aspandey at redhat.com>
> >> *Cc: *gluster-users at gluster.org
> >> *Sent: *Wednesday, April 13, 2016 3:26:53 PM
> >> *Subject: *Re: [Gluster-users] Need some help on Mismatching xdata /
> >> Failed combine iatt / Too many fd
> >>
> >> Hi Ashish and other Gluster Users,
> >>
> >> When I put some heavy IO load onto my cluster (a rsync operation,
> >> ~600MB/s), one of the node instantly get inode locked and teared down
> >> the whole cluster. I've already turned on "features.lock-heal" but it
> >> didn't help.
> >>
> >> My clients is using a round-robin tactic to mount servers, hoping to
> >> average the pressure. Could it be caused by a race between NFS servers
> >> on different nodes? Should I instead create a dedicated NFS Server with
> >> huge memory, no brick, and multiple Ethernet cables?
> >>
> >> I really appreciate any help from you guys.
> >>
> >> Best wishes,
> >> Chen
> >>
> >> PS. Don't know why the native fuse client is 5 times inferior than the
> >> old good NFSv3.
> >>
> >> On 4/4/2016 6:11 PM, Ashish Pandey wrote:
> >> > Hi Chen,
> >> >
> >> > As I suspected, there are many blocked call for inodelk in
> >> sm11/mnt-disk1-mainvol.31115.dump.1459760675.
> >> >
> >> > =============================================
> >> > [xlator.features.locks.mainvol-locks.inode]
> >> > path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar
> >> > mandatory=0
> >> > inodelk-count=4
> >> > lock-dump.domain.domain=mainvol-disperse-0:self-heal
> >> > lock-dump.domain.domain=mainvol-disperse-0
> >> > inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid
> >> = 1, owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0,
> >> connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0,
> >> blocked at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58
> >> > inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
> >> pid = 1, owner=1414371e1a7f0000, client=0x7ff034204490,
> >>
> connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0,
> >> blocked at 2016-04-01 16:58:51
> >> > inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
> >> pid = 1, owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0,
> >>
> connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0,
> blocked
> >> at 2016-04-01 17:03:41
> >> > inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0,
> >> pid = 1, owner=b41a0482867f0000, client=0x7ff01800e670,
> >>
> connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0,
> >> blocked at 2016-04-01 17:05:09
> >> > =============================================
> >> >
> >> > This could be the cause of hang.
> >> > Possible Workaround -
> >> > If there is no IO going on for this volume, we can restart the
> >> volume using - gluster v start <volume-name> force. This will restart
> >> the nfs process too which will release the locks and
> >> > we could come out of this issue.
> >> >
> >> > Ashish
> >>
> >> --
> >> Chen Chen
> >> Shanghai SmartQuerier Biotechnology Co., Ltd.
> >> Add: Add: 3F, 1278 Keyuan Road, Shanghai 201203, P. R. China
> >> Mob: +86 15221885893
> >> Email: chenchen at smartquerier.com
> >> Web: www.smartquerier.com
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-users
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-users
> >
>
> --
> Chen Chen
> Shanghai SmartQuerier Biotechnology Co., Ltd.
> Add: Add: 3F, 1278 Keyuan Road, Shanghai 201203, P. R. China
> Mob: +86 15221885893
> Email: chenchen at smartquerier.com
> Web: www.smartquerier.com
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
More information about the Gluster-users
mailing list