[Gluster-users] The continuing story ...

Thu Sep 10 16:15:00 UTC 2009

> > Which actually reinforces the point that glusterfs has very little
> to
> > do with this kernel lockup. It is not even performing the special
> fuse
> > protocol communication with the kernel in question. Just plain
> vanilla
> > POSIX system calls on disk filesystem and send/recv on TCP/IP
> sockets.
> 
> this does not reinforce anything special, gluster may be eating
> resources and not releasing them or calling system functions with
> bad arguments and the system may run out of them or enter some race
> condition and produce the lock.

Instead of guessing and contemplating and using your brain cycles to figure out the cause, have you instead taken the effort to post the kernel backtraces you have to the linux-kernel mailing list yet? All you need to do is compose an email with the attachment you have already posted here previous and shoot it out to LKML.

> Just note that a user has pointed in
> another message part of code not testing for null pointers, so the
> code could be plenty of similar things that can produce undesirable
> and/or unknown side effects.

Now, you are clearly proving that you have no clue about what you just spoke, neither have you been reading my previous explanations. You have no clue what a NULL pointer in a userland app can do and cannot do. And you talk about unknown and undesirable effects of such programming bugs of a userland application without understanding fundamental operating system concepts of kernel memory isolation and system call mechanisms between processes and kernels. A missing NULL check can result in a segfault of glusterfsd. A userspace application has a limit to the damage it can cause. And that limitation is by virtue of it being a userspace app in the first place. 

Is glusterfsd eating up and not releasing resources? It could be. It may not be. That being the trigger for the kernel lockup is one among the very many possibilities. At the outset looking at the backtrace, it does not appear to be anything to do with resource leaks or with glusterfs. To find out what the soft lockup is all about, write to LKML and ask. A soft lock up is a kernel bug whether you personally like it or not. Whether glusterfs is triggering this soft lockup is not clear. Lets say it indeed is. It could be doing something like an extended attribute call with a 2^n+1 byte buffer which triggered an off-by-1 bug in the kernel. Or maybe it sent data in packet sizes which resulted in a certain pattern of fragmentation leading to what not. Or maybe it allocated and freed memory regions in a particular order. What all kind of debugging would you like to see get added in glusterfs? Would you like glusterfs to check for itself if it is performing system calls too frequently? or after an odd number of jiffy intervals? Or if it allocates prime number of bytes for memory allocation? It is those kinds of races and equally wierd corner cases which result in soft lockups. Do you expect every userland application which have triggered kernel soft lockups till date to impelment such instrumentation within them?

In the end, whether glusterfs has such instrumentation or not, the path for the answer you are looking for is in that very kernel backtrace. Your approach to debugging this kernel soft lockup is extremely inefficient for both you and us. glusterfs misbehavior (definitely not null pointer access!) is one of the possibilities. Though very unlikely from how your kernel backtrace appears. Work on the evidence you already have. Do you want me to post your backtraces on LKML on your behalf? Those developers are going to answer you whether it is that unlikely case of an application leaking resources which caused this lockup, or if it is a programming bug within the kernel itself. Without this initial groundwork on the primary evidence you already have in hand, please do not expect any further assistance on this list for debugging the soft lockup till you have an indication from the kernel developers that the cause is a misbehaving user app. No other (user space) project will help you with such lockups either. There have been cases where rsync triggers a soft lockup, but scp does not. Do you blame rsync for not having sufficient instrumentation and debugging techniques, or accuse it of eating up resources -- all this even before you post the kernel backtrace on LKML? You really will be making a fool of yourself for this approach on other project lists, let alone receiving such patient replies and advice.

This apart, we have no hard feelings and are just as keen in resolving other glusterfs bug reports from you.

Avati