[Gluster-users] The continuing story ...

Thu Sep 10 19:09:02 UTC 2009

On 09/10/2009 09:38 AM, David Saez Padros wrote:
>
>> In particular, if you read about the intent of FUSE - the technology 
>> being used to create a file system, I think you will find that what 
>> Anand is saying is the *exact* purpose for this project.
>
> the lockups are on server side not in client side and fuse is
> not used on the server side

I think there is Stephan's problem and your problem, and I'm losing 
track over which one is being discussed. Sorry. :-)

Server side, pure user space, with hardware locking up, or the kernel 
not be able to use a hardware resource - is a kernel problem. Yes, user 
space can trigger it - for example, by opening so many sockets and other 
such kernel resources, as to fill low memory - but as we found out 
recently, this is where the kernel is supposed to come in and kick the 
user program out with an out of memory killer, or not grant the 
resources in the first place.

As it is - do we have evidence that GlusterFS is using up large number 
of file descriptors, sockets, processes, virtual memory, or other kernel 
resource? It seems to me that the failure in the case with the logs was 
the kernel finding the CPU not waking up for a long period of time?

I'm not saying ignore GlusterFS in your evaluation - but I am saying if 
you truly want a resolution, you really should consider trying the linux 
developers, and seeing what they think. If they say this is a GlusterFS 
specific problem, I'm sure Anand and gluster.com would take a very 
serious second look at it. Until then - they gave it a shot, and don't 
have the ability to diagnose your problem or fix your problem. You could 
say they are incompetent and uncaring about their users - but a more 
accurate statement would probably be that this is entirely out of their 
domain, and they are unable to help you, and their professional 
recommendation and mine is to contact RedHat if you have a subscription, 
or if you do not, try the linux developers.

I have no doubt at all that user space programs can hurt the kernel - 
but in every situation I can think of, the problem is really a *kernel* 
problem. The user space is just discovering the problem - which is 
unfortunate - but honestly, shit happens. We recently dealt with load 
builds failing due to the out of memory issue I reference above, as 
32-bit linux kernel doesn't work very well with 32 Gbytes of RAM. 
Another problem we dealt with was Subversion mod_dav_fs quickly 
consuming all virtual memory in the machine, eventually leading to 
machine failure. For the Subversion issue - mod_dav_fs or something is 
uses should not be continually consuming more memory - so they have a 
bug - but the kernel *also* has a bug, because it should not allow httpd 
to bring the machine to a halt due to exhausted virtual memory. In the 
Subversion case, it's low on our priority list to solve, since we can 
work around it by having Apache recycle the process space more 
frequently and avoid the symptoms - but we should be taking this to both 
the Subversion developers at Collab.net *and* the Linux kernel 
developers. (I know what the Linux kernel developers will say though - 
32-bit kernel was not designed for 32 Gbytes of RAM, and upgrade to a 
64-bit kernel - but we have RHEL subscription, so perhaps we could take 
it that route...)

Cheers,
mark

-- 
Mark Mielke<mark at mielke.cc>