[Gluster-users] Meta-discussion

Brian Candler B.Candler at pobox.com
Wed Jan 2 11:49:03 UTC 2013


On Thu, Dec 27, 2012 at 06:53:46PM -0500, John Mark Walker wrote:
> I invite all sorts of disagreeable comments, and I'm all for public
> discussion of things - as can be seen in this list's archives.  But, for
> better or worse, we've chosen the approach that we have.  Anyone who would
> like to challenge that approach is welcome to take up that discussion with
> our developers on gluster-devel.  This list is for those who need help
> using glusterfs.
> 
> I am sorry that you haven't been able to deploy glusterfs in production.
> Discussing how and why glusterfs works - or doesn't work - for particular
> use cases is welcome on this list.  Starting off a discussion about how
> the entire approach is unworkable is kind of counter-productive and not
> exactly helpful to those of us who just want to use the thing.

For me, the biggest problems with glusterfs are not in its design, feature
set or performance; they are around what happens when something goes wrong. 
As I perceive them, the issues are:

1. An almost total lack of error reporting, beyond incomprehensible entries
in log files on a completely different machine, made very difficult to find
because they are mixed in with all sorts of other incomprehensible log
entries.

2. Incomplete documentation. This breaks down further as:

2a. A total lack of architecture and implementation documentation - such as
what the translators are and how they work internally, what a GFID is, what
xattrs are stored where and what they mean, and all the on-disk states you
can expect to see during replication and healing.  Without this level of
documentation, it's impossible to interpret the log messages from (1) short
of reverse-engineering the source code (which is also very minimalist when
it comes to comments); and hence it's impossible to reason about what has
happened when the system is misbehaving, and what would be the correct and
safe intervention to make.

glusterfs 2.x actually had fairly comprehensive internals documentation, but
this has all been stripped out in 3.x to turn it into a "black box". 
Conversely, development on 3.x has diverged enough from 2.x to make the 2.x
documentation unusable.

2b. An almost total lack of procedural documentation, such as "to replace a
failed server with another one, follow these steps" (which in that case
involves manually copying peer UUIDs from one server to another), or "if
volume rebalance gets stuck, do this".  When you come across any of these
issues you end up asking the list, and to be fair the list generally
responds promptly and helpfully - but that approach doesn't scale, and
doesn't necessarily help if you have a storage problem at 3am.

For these reasons, I am holding back from deploying any of the more
interesting features of glusterfs, such as replicated volumes and
distributed volumes which might grow and need rebalancing.  And without
those, I may as well go back to standard NFS and rsync.

And yes, I have raised a number of bug reports for specific issues, but
reporting a bug whenever you come across a problem in testing or production
is not the right answer.  It seems to me that all these edge and error cases
and recovery procedures should already have been developed and tested *as a
matter of course*, for a service as critical as storage.

I'm not saying there is no error handling in glusterfs, because that's
clearly not true.  What I'm saying is that any complex system is bound to
have states where processes cannot proceed without external assistance, and
these cases all need to be tested, and you need to have good error reporting
and good documentation.

I know I'm not the only person to have been affected, because there is a
steady stream of people on this list who are asking for help with how to
cope with replication and rebalancing failures.

Please don't consider the above as non-constructive. I count myself amongst
"those of us who just want to use the thing".  But right now, I cannot
wholeheartedly recommend it to my colleagues, because I cannot confidently
say that I or they would be able to handle the failure scenarios I have
already experienced, or other ones which may occur in the future.

Regards,

Brian.



More information about the Gluster-users mailing list