[Gluster-users] Replication not working on server hang
Mark Mielke
mark at mark.mielke.cc
Sun Aug 30 15:10:34 UTC 2009
On 08/30/2009 04:00 AM, Anand Avati wrote:
>> I'm wondering if there's some way for glusterfs to detect the flaws of the
>> underlying operating system. I believe there's no bug-free file systems in
>> the universe, so I believe it is the job of the glusterfs developer to
>> specify which underlying filesystem is tested and supported. It's not good
>> to simply say that glusterfs works on all real-world approximations to an
>> imaginary bug-free posix filesystem.
>>
> I would be genuinely interested to know about another project which is
> geared up to be resilient against kernel hangs so that we can borrow
> some ideas on how to reliably detect kernel soft lockups or syscall
> hangs. As far as I know, even mature projects like Apache have not
> bothered fixing such hangs (or even detecting this kind of underlying
> OS flaw).
>
>
There are projects that require kernel patches to work properly (for
example, the OpenVZ project), and most Linux distributions (i.e. RedHat)
maintain a set of kernel patches. Vendors may provide work arounds for
known kernel problems - for example, the dovecot people go through
various means to flush the NFS or FUSE cache (including for GlusterFS)
before doing certain operations, and these are done using non-portable
operations.
Summary of it is that relying on the Linux kernel to be correct in all
situations (or any kernel for that matter) will have limits. Sometimes,
it is necessary to track down the problem, correct it, and provide a
patch. This can involve discussions on linux-dev leading to it finally
being corrected upstream, and no longer needing to provide a patch. Not
saying it has to go this far - but unless the problem is understood, it
shouldn't be written off either. If GlusterFS can issue a set of
operations that reproducibly causes ext3 to freeze, this is of a concern
for both the ext3 developers/maintainers and the GlusterFS
developers/maintainers, and it is a joint problem to solve, since ext3
is so common.
As for detecting lockups or hangs - I'm not aware of this being done in
the userspace area, but it could be argued that this is a bit artificial
of a comparison, because GlusterFS is at its base, a network file
system, and it *is* common for network file systems (such as NFS) to
deal with problems with the underlying volumes. GlusterFS uses FUSE as a
novel approach to avoiding the problem entirely - but if GlusterFS from
user space can cause the backend storage volume to freeze up, even from
outside GlusterFS, then it seems like the user space barrier is
insufficient.
For all of the above - I am assuming that GlusterFS is being used to do
something which ends up locking up the entire volume, even from outside
GlusterFS. If anybody is experiencing GlusterFS *only* problems, where
the underlying volume is still accessible from another process, than
this would be a different problem, probably GlusterFS specific.
Cheers,
mark
--
Mark Mielke<mark at mielke.cc>
More information about the Gluster-users
mailing list