[Gluster-users] Failed rebalance resulting in major problems
Joe Julian
joe at julianfamily.org
Wed Nov 6 20:15:40 UTC 2013
On 11/06/2013 11:52 AM, Justin Dossey wrote:
> Shawn,
>
> I had a very similar experience with a rebalance on 3.3.1, and it took
> weeks to get everything straightened out. I would be happy to share
> the scripts I wrote to correct the permissions issues if you wish,
> though I'm not sure it would be appropriate to share them directly on
> this list. Perhaps I should just create a project on Github that is
> devoted to collecting scripts people use to fix their GlusterFS
> environments!
>
> After that (awful) experience, I am loath to run further rebalances.
> I've even spent days evaluating alternatives to GlusterFS, as my
> experience with this list over the last six months indicates that
> support for community users is minimal, even in the face of major bugs
> such as the one with rebalancing and the continuing "gfid different on
> subvolume" bugs with 3.3.2.
I'm one of oldest GlusterFS users around here and one of the biggest
proponents and even I have been loath to rebalance until 3.4.1.
There are no open bugs for gfid mismatches that I could find. The last
time someone mentioned that error in IRC it was 2am, I was at a
convention, and I told the user how to solve that problem (
http://irclog.perlgeek.de/gluster/2013-06-14#i_7196149 ). It was caused
by split-brain. If you have a bug, it would be more productive to file
it rather than make negative comments about a community of people that
have no requirement to help anybody, but do it anyway just because
they're nice people.
This is going to sound snarky because it's in text, but I mean this
sincerely. If community support is not sufficient, you might consider
purchasing support from a company that provides it professionally.
>
> Let me know what you think of the Github thing and I'll proceed
> appropriately.
Even better, put them up on http://forge.gluster.org
>
>
> On Tue, Nov 5, 2013 at 9:05 PM, Shawn Heisey <gluster at elyograg.org
> <mailto:gluster at elyograg.org>> wrote:
>
> We recently added storage servers to our gluster install, running
> 3.3.1
> on CentOS 6. It went from 40TB usable (8x2 distribute-replicate) to
> 80TB usable (16x2). There was a little bit over 20TB used space
> on the
> volume.
>
> The add-brick went through without incident, but the rebalance failed
> after moving 1.5TB of the approximately 10TB that needed to be
> moved. A
> side issue is that it took four days for that 1.5TB to move. I'm
> aware
> that gluster has overhead, and that there's only so much speed you can
> get out of gigabit, but a 100Mb/s half-duplex link could have
> copied the
> data faster if it had been a straight copy.
>
> After I discovered that the rebalance had failed, I noticed that there
> were other problems. There are a small number of completely lost
> files
> (91 that I know about so far), a huge number of permission issues
> (over
> 800,000 files changed to 000), and about 32000 files that are throwing
> read errors via the fuse/nfs mount but seem to be available
> directly on
> bricks. That last category of problem file has the sticky bit
> set, with
> almost all of them having ---------T permissions. The good files on
> bricks typically have the same permissions, but are readable by
> root. I
> haven't worked out the scripting necessary to automate all the fixing
> that needs to happen yet.
>
> We really need to know what happened. We do plan to upgrade to 3.4.1,
> but there were some reasons that we didn't want to upgrade before
> adding
> storage.
>
> * Upgrading will result in service interruption to our clients, which
> mount via NFS. It would likely be just a hiccup, with quick failover,
> but it's still a service interruption.
> * We have a pacemaker cluster providing the shared IP address for NFS
> mounting. It's running CentOS 6.3. A "yum upgrade" to upgrade
> gluster
> will also upgrade to CentOS 6.4. The pacemaker in 6.4 is incompatible
> with the pacemaker in 6.3, which will likely result in
> longer-than-expected downtime for the shared IP address.
> * We didn't want to risk potential problems with running gluster 3.3.1
> on the existing servers and 3.4.1 on the new servers.
> * We needed the new storage added right away, before we could schedule
> maintenance to deal with the upgrade issues.
>
> Something that would be extremely helpful would be obtaining the
> services of an expert-level gluster consultant who can look over
> everything we've done to see if there is anything we've done wrong and
> how we might avoid problems in the future. I don't know how much the
> company can authorize for this, but we obviously want it to be as
> cheap
> as possible. We are in Salt Lake City, UT, USA. It would be
> preferable
> to have the consultant be physically present at our location.
>
> I'm working on redacting one bit of identifying info from our
> rebalance
> log, then I can put it up on dropbox for everyone to examine.
>
> Thanks,
> Shawn
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
> --
> Justin Dossey
> CTO, PodOmatic
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131106/46338fa8/attachment.html>
More information about the Gluster-users
mailing list