[Gluster-users] 2.0.6

Fri Aug 21 22:42:21 UTC 2009

Stephan,
   Please find replies below. I am merging the thread back to the ML.

> > Stephan, we need some more info. I think we are a lot closer to
> diagnosing this issue now. The hang is being caused by an io-thread
> getting hung either as a deadlock inside the glusterfsd process code,
> or blocked on disk access for an excessively long time. The following
> details will be _extremely_ useful for us.
> > 
> > 1. What is your backend FS, kernel version and distro running on
> server2? Is the backend FS on a local disk or some kind of SAN or
> iSCSI?
> 
> The backend FS is reiserfs3, kernel version 2.6.30.5, distro openSuSE
> 11.1.
> The backend FS resides on a local Areca RAID system. See attached
> output of
> former email. 
> 
> > 2. Was the glusterfsd on server2 taking 100% cpu at the time of the
> hang?
> 
> I can only try to remember that from the time I took the strace logs.
> I am not
> a 100% sure, but from typing and looking I would say the load was very
> low,
> probably next to zero. 
> 
> > 3. On server2, now that you have killed it with -11, you should be
> having a core file in /. Can you get the backtrace from all the
> threads? Please use the following commands -
> > 
> > sh# gdb /usr/sbin/glusterfsd -c /core.X
> > 
> > and then at the gdb prompt
> > 
> > (gdb) thread apply all bt
> > 
> > This should output the backtraces from all the threads.
> 
> The bad news is this: we were not able to normally shut down the box
> because
> the local (exported) fs hung completely. So shutdown did not work. We
> had to
> hard-reset it. When examining the box few minutes ago we had to find
> out that
> all logs (and likely the core dump) were dumped and lost. I have seen
> this
> kind of behaviour before, it is originated from reiserfs3 and not
> really
> unusual. This means: we redo the test and hope we can force the
> problem again.
> Then we take all possible logs, dmesg, cores away from the server
> before
> rebooting it. I am very sorry we lost the important part of
> information... 

Stephan,
   This clearly points that the root cause for the bonnie hangs which you have been facing on every release of 2.0.x is because of the hanging reiserfs export you have. When you have the backend FS which is misbehaving, this is the expected behavior of GlusterFS. Not only will you see this in all versions of GlusterFS, you will face the same hangs even with NFS or even running bonnie directly  on your backend FS. All the IO calls are getting queued and blocked in the IO thread which is touching the disk, and the main FS thread is up responding to ping-pong requests, thus keeping the server "alive". All of us on this ML could have spent far fewer cycles if the initial description of the problem included a note which mentioned that one of the server's backend reiserfs3 is known to freeze in the environment before. When someone reports a hang on the glusterfs mountpoint, the first thing we developers do is trying to find code paths for what we call "missing frames" (technically it is a syscall leak, somewhat like a memory leak) and this is a very demanding and time consuming debugging for us. All the information you can provide us will only help us debug the issue faster.

All,
   The reason I merged this thread back with the ML is because we want to request anybody reporting issues to give as much information as possible upfront. In the interest of all of us, both the developers' and more importantly of the community for getting quicker releases, good bug reports are the best thing you can offer us. Please describe the FS configuration, environment, application and steps to reproduce issue with versions, configs and logs of every relevant component. And if you can, in fact, report all this directly on our bug tracking site (http://bugs.gluster.com) (and keep the MLs for discussions as much as possible) that would be the best you can do for us.

Thank you for all the support!

Avati