[Gluster-users] NFS availability

Fri Feb 1 01:32:07 UTC 2013

On 01/31/2013 03:57 PM, Stephan von Krawczynski wrote:
> On Thu, 31 Jan 2013 16:00:38 -0500
> Jeff Darcy <jdarcy at redhat.com> wrote:
>
>>> Most common network errors are not a matter of design, but of dead
>>> iron.
>> It's usually both - a design that is insufficiently tolerant of
>> component failure, plus a combination of component failures that exceeds
>> that tolerance.  You seem to have a very high standard for filesystems
>> continuing to maintain 100% functionality - and I suppose 100%
>> performance as well - if there's any possibility whatsoever that they
>> could do so.  Why don't you apply that same standard to the part of the
>> system that you're responsible for designing?  Running any distributed
>> system on top of a deficient network infrastructure will lead only to
>> disappointment.
> I am sorry that glusterfs is part of the design and your critics.
This sentence is incomprehensible. GlusterFS is part of the design and 
his/our critics? I haven't seen anything from GlusterFS criticizing anyone.
> Everyone working sufficiently long with networks of all kinds of sizes and
> components can tell you that in the end you want a design for a file service
> that works as long as possible. This means it should survive even if there is
> only one client and server and network path left.
Most users, if they're down to one client and one server, they're out of 
business anyway. I know we can't run our whole company on one or two 
computers, and we're not even all that big.
> At least that is what is expected from glusterfs. Unfortunately sometimes you
> get disappointed. We saw just about everything happening when switching off
> all but one reliable network path including network hangs and server hangs
> (the last one) (read the list for examples by others).
After taking a power hit to the building which was able to get through 
our commercial ups and line conditioners (which also had to be replaced 
after that), our switches all had to be replaced. Part of my testing was 
to do just that. I pulled the plugs on all the switches, leaving only 
one data path then changing that path. All the while I had full activity 
on all my volumes, including innodb transactions, vm image activity, 
web, samba, and workstation home directories. I ran multiple dd tests 
from multiple clients during these tests. Not once did they even slow down.
> On the other end of the story clients see servers go offline if you increase
> the non-gluster traffic on the network. Main (but not only) reason is the very
> low default ping time (read the list for examples by others).
42 seconds is low? I've never (and I used to have really crappy dual 
32bit xeon servers) saturated my systems to the point where it took 42 
seconds for a server to respond. If you're switches can't handle that 
kind of traffic, perhaps you're using the wrong hardware.
> All these seen effects show clearly that noone ever tested this to an extent I
> would have done writing this kind of software. After all this is a piece of
> software whose merely only purpose is surviving dead servers and networks.
> It is no question of design, because on paper everything looks promising.
>
> Sometimes your arguments let me believe you want glusterfs working like a ford
> car. A lot of technical gameplay built in but the idea that a car should be a
> good car in the first place got lost on the way somewhere. Quite a lot of the
> features built in lately have the quality of an mp3-player in your ford. Nice
> to have but does not help you a lot driving 200 and a rabbit crossing.
> And this is why I am requesting the equivalent of a BMW.
>
Using your own analogy, what good is a BMW if you've got no roads to 
drive it on?

You're talking in terms of single entities. Most (if not all) of the 
sysadmins I work with on a daily basis, my peers in the industry, 
members of LOPSA... we work in systems. We know how to build in 
redundancy and plan for and survive failures. There's not a week that 
goes by where something in my system hasn't encountered some issue, yet 
our company has not lost any productivity because of it. We have the 
best availability of systems of all our competitors (and I do monitor 
their systems).

I'm not saying you're doing it wrong. Be as asssured as you feel is 
appropriate for your system requirements. Just be aware that the 
majority of the industry does not share those requirements. I'm not 
speaking from my "gut instinct" but I am a member of several 
professional organizations. I attend functions on a weekly basis 
attended by members of organizations like Expedia, Ebay, Amazon, Google, 
Boeing, Starbucks, etc, etc. and we talk about this stuff. Fault 
tolerance and recovery is a big part of what we do, probably the 
biggest, and I still advise the way I do, not through just my own 
experiences, but through the experiences of my peers.

Offer advice and back it up with facts, anecdotes, and/or tests, but 
accept that there are as many ways of managing systems as there are 
systems to be managed. Accept that there are professionals in the world 
that have been doing it longer, have more experience, and (and this is 
not to say anything negative about yourself) are smarter.