[Gluster-users] Replication not working on server hang

Sat Aug 29 17:40:00 UTC 2009

Hi All,

I'm afraid that I have some more fuel to add to the glusterfs hanging
"fire".

Way back when experimenting with 1.3, I began experiencing hangs.

Then we added the read-ahead Xlator to the server and the hangs
miraculously stopped.
That may well be a coincidence, I don't know, but we never hung while
read-ahead was loaded.

Then came version 2.0 and we hit a bug:

http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=144

So, we had to take out read-ahead and now we have the hangs again! Doh
& Double Doh!

This has forced me to take glusterfs out of production, and management
is now questioning my decision to utilize it at all (a subscription
won't be purchased anytime soon).

Some points to note:

I'm using ext3, the rest of my set-up is detailed in the above
bugzilla report.

My hangs have often been triggered with a grep -R on the glusterFS
mount (yes, just reading!).

None of my hangs have ever given me a single log entry.

When hung, the server affected cannot be logged into. Just get the
first line 'Last login:...'

This, and other services I run, seem to indicate that existing
processes that have already open FD's NOT on glusterfs can continue to
execute, but no new FD's can be opened at all, system wide.

To date there has been a lot of talk about the underlying FS being an
issue in these cases.

I seriously doubt it, & certainly not in the case of ext3.

I agree that the server process shouldn't be able to hang a stable
system, but what about the client?

Could this be the work of GlusterFS/Fuse/Kernel interaction?

Whatever the cause, it is one very large show stopper that we MUST
rectify.

We may well be dealing with several parallel issues.
Finding common factors in our glusterfs instances should help us
narrow down the search.

Regards, Jeff.