[Gluster-devel] Re: [Gluster-users] 2.0.6

Sat Aug 22 17:24:48 UTC 2009

> Ok, please stay serious. As described in my original email from 19th
> effectively _all_ four physical boxes have not-moving (I deny to use "hanging"
> here) gluster processes. The mount points on the clients hang (which made
> bonnies stop), the primary server looks pretty much ok, but does obviously
> serve nothing to the clients, and the secondary has a hanging local fs for
> what causes ever.

Can you please confirm this - that the reiserfs filesystem hosting the
storage/posix volume's export directory, stops responding to df when
you run df on the server machine? This is what your mails have been
making me understand. Please correct me if I'm wrong.

> Now can you please elaborate how you come to the conclusion that this complete
> service lock up derives from one hanging fs on one secondary server of a
> replicate setup (which you declare as the cause and I as the effect of locked
> up gluster processes).

I already explained this in a previous email in this thread. Here it
goes again - the glusterfsd on that particular machine having the
locked up backend fs has two types of threads - the main/event thread,
and IO threads (loaded as translators). All disk access happens via IO
threads. Based on the logs you sent me outside this mail thread (which
I still request you to file as a bug so that it is more convenient to
refer), it is clear that glusterfsd has a lot of syscall requests
"queued" in the IO thread, and the main/event thread is active and
responding to ping-pongs. The fact that the filesystem is able to
respond to ping messages within "ping-timeout" seconds (defaults to
10) keeps the status of this server as 'alive'.

   Replicate module which is used a synchronous replicator.
Modification operations are performed on both subvolumes under a
single transaction. Most read operations happen just off a single
subvolume, but some read operations like 'df' wait for the statistics
from all the subvol to present the value of the least-free disk as the
final least-free disk etc. Now when a syscall is requested to a server
and the server takes a very long time to respond (like in this case)
to the syscall, but the server is responding to ping-pong requests,
then a different timeout (frame-timeout) comes into picture, which by
default gives 30 _minutes_ for the server to complete just that
operation, and on the failure of that message to return, the client
bails out just _that_ call. All this activity is clearly visible in
the logs - syscalls failing to respond, and after 1800 secs syscalls
getting bailed out in the client logs, just for server2. Because
glusterfsd on the server is responding to ping-pongs, the connection
itself is not broken and syscalls are given the full frame-timeout
seconds opportunity to complete. So syscalls end up "blocking" on fuse
mountpoint for 30mins, but the connection itself is not considered to
be dead since the ping-pongs are continuing.

  This in turn results in the df call on the glusterfs (fuse)
mountpoint block itself till the call from the second server is bailed
out. It could end up being more than 30mins if the fuse kernel
module's requests queue is already full of such blocking calls (it
will get queued only after one of them return)

  If you find the above description too verbose, I dont mind giving a
second shot trying to explain in simpler terms. The hanging of the
backend filesystem cannot be caused by glusterfsd, or any application.
For whatever reason (very likely reason is a bug in the backend fs)
the backend fs is hanging, which is indicated by both glusterfs logs
(based on frame-timeouts happening but not ping-timeout) and your
interactive shell experience (df hanging). This is resulting in
glusterfs fuse mountpoint to "hang" the system calls waiting for these
syscall repsonses from the server for frame-timeout(1800) seconds.

As you rightly summarized,
Your theory: glusterfs is buggy (cause) and results in all fuse
mountpoints hanging, and also results in server2's backend fs hanging
(effect)

My theory: your backend fs is buggy (cause) and hangs and results in
all fuse mountpoints to hang (effect) which happens because of reasons
explained above

I maintain that my theory is right because glusterfsd just cannot
cause a backend filesystem to hang, and if it indeed did, the bug is
in the backend fs because glusterfsd only performs system calls to
access it.

Avati