[Gluster-users] Self healing of 3.3.0 cause our 2 bricks replicated cluster freeze (client read/write timeout)

Thu Nov 29 12:26:46 UTC 2012

How about an option to throttle/limit the self heal speed?  DRBD has a speed limit, which very effectively cuts down on the resources needed.

That being said, I have not had a problem with self heal on my VM images.  Just two days ago, I deleted all images from one brick and let the self heal put everything back, rebuilding the entire brick while VM's were running, during business hours (Disk failure force me to do it).

Gerald

----- Original Message -----
> From: "Joe Julian" <joe at julianfamily.org>
> To: gluster-users at gluster.org
> Sent: Thursday, November 29, 2012 12:37:37 AM
> Subject: Re: [Gluster-users] Self healing of 3.3.0 cause our 2 bricks replicated cluster freeze (client read/write
> timeout)
> 
> Ok listen up everybody. What you're experiencing is not that self
> heal
> is a blocking operation. You're either running out of bandwidth,
> processor, bus... Whatever it is, it's not that.
> 
> That was fixed in commit 1af420c700fbc49b65cf7faceb3270e81cd991ce.
> 
> So please, get it out of your head that this is just that the feature
> was never added. It was. It's been tested successfully by many admins
> on
> many different systems.
> 
> Once it's out of your head that it's a missing feature, PLEASE try to
> figure out why YOUR system is showing the behavior that you're
> experiencing. I can't do it. It's not failing for me. Then file a bug
> report explaining it so these very intelligent guys can figure out a
> solution. I've seen how that works. When Avati sees a problem, he'll
> be
> sitting on the floor in a hallway because it has WiFi and an outlet
> and
> he won't even notice that everyone else has gone to lunch, come back,
> gone to several panels, come back again, and that the expo hall is
> starting to clear out because the place is closing. He's focused and
> dedicated. All these guys are very talented and understand this stuff
> better than I ever can. They will fix the bug if it can be
> identified.
> 
> The first step is finding the actual problem instead of pointing to
> something that you're just guessing isn't there.
> 
> On 11/28/2012 09:24 PM, ZHANG Cheng wrote:
> > I dig out an gluster-users m-list thread dated 2011-June at
> > http://gluster.org/pipermail/gluster-users/2011-June/008111.html.
> >
> > In this post, Marco Agostini said:
> > ==================================================
> > Craig Carl said me, three days ago:
> > ------------------------------------------------------
> >   that happens because Gluster's self heal is a blocking operation.
> >   We
> > are working on a non-blocking self heal, we are hoping to ship it
> > in
> > early September.
> > ------------------------------------------------------
> > ==================================================
> >
> > Looks like even with release of 3.3.1, self heal is still a
> > blocking
> > operation. I am wondering why the official Administration Guide
> > doesn't warn the reader about such important thing regarding
> > production operation.
> >
> >
> > On Mon, Nov 26, 2012 at 5:46 PM, ZHANG Cheng <czhang.oss at gmail.com>
> > wrote:
> >> Early this morning our 2 bricks replicated cluster had an outage.
> >> The
> >> disk space for one of the brick server (brick02) was used up. When
> >> we
> >> responded to the disk full alert, the issue already lasted for a
> >> few
> >> hours. We reclaimed some disk space, and reboot the brick02
> >> server,
> >> expecting once it come back it will go self healing.
> >>
> >> It did go self healing, but just after couple minutes, access to
> >> gluster filesystem freeze. Tons of "nfs: server brick not
> >> responding,
> >> still trying" popped up in dmesg. The load average on app server
> >> went
> >> up to 200 something from usual 0.10. We had to shutdown brick02
> >> server
> >> or stop gluster server process on it, to get the gluster cluster
> >> back
> >> working.
> >>
> >> How could we deal with this issue? Thanks in advance.
> >>
> >> Our gluster setup is followed the official doc.
> >>
> >> gluster> volume info
> >>
> >> Volume Name: staticvol
> >> Type: Replicate
> >> Volume ID: fdcbf635-5faf-45d6-ab4e-be97c74d7715
> >> Status: Started
> >> Number of Bricks: 1 x 2 = 2
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: brick01:/exports/static
> >> Brick2: brick02:/exports/static
> >>
> >> Underlying filesystem is xfs (on a lvm volume), as:
> >> /dev/mapper/vg_node-brick on /exports/static type xfs
> >> (rw,noatime,nodiratime,nobarrier,logbufs=8)
> >>
> >> The brick servers don't act as gluster client.
> >>
> >> Our app servers are the gluster client, mount via nfs.
> >> brick:/staticvol on /mnt/gfs-static type nfs
> >> (rw,noatime,nodiratime,vers=3,rsize=8192,wsize=8192,addr=10.10.10.51)
> >>
> >> brick is a DNS round-robin record for brick01 and brick02.
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>