[Gluster-devel] Too many open files

Brent A Nelson brent at phys.ufl.edu
Fri Apr 6 20:51:57 UTC 2007


I think you're right; the makefile glitch must have thrown off the rest of 
the compile.  A fresh attempt seems stable, and something which was 
previously able to quickly trigger the directory fd bug now runs 
perfectly.

Looks good!

Thanks,

Brent

On Fri, 6 Apr 2007, Anand Avati wrote:

> Brent,
>  can you please send me your spec files? because I am able to 'ls'
> without any problems and there is no fd leak observed. I have loaded
> just cluster/afr, and previously had loaded all performance xlators on
> both server and clietn side together and in both the cases things
> worked perfectly fine.
>
> I'm guessing the encrytpion makefile issue caused a bad build? (things
> were changed in libglusterfs). the makefile is committed now though
> (along with the -l fix). please do a make uninstall/clean/install
> since quit a chunk of changes have gone in the last few days.
>
> avati
>
> On Fri, Apr 06, 2007 at 03:33:30PM -0400, Brent A Nelson wrote:
>> glusterfsd dies on both nodes almost immediately (I can ls succesfully
>> once before it dies, but cd in and they're dead).  The glusterfs processes
>> are still running, but I of course have "Transport endpoint is not
>> connected."
>>
>> Also, glusterfsd and glusterfs no longer seem to know where to log by
>> default and refuse to start unless I give the -l option on each.
>>
>> Thanks,
>>
>> Brent
>>
>> On Fri, 6 Apr 2007, Anand Avati wrote:
>>
>>> Brent,
>>> the fix has been committed. can you please check if it works for you?
>>>
>>> regards,
>>> avati
>>>
>>> On Thu, Apr 05, 2007 at 02:09:26AM -0400, Brent A Nelson wrote:
>>>> That's correct.  I had commented out unify when narrowing down the mtime
>>>> bug (which turned out to be writebehind) and then decided I had no reason
>>>> to put it back in for this two-brick filesystem.  It was mounted without
>>>> unify when this issue occurred.
>>>>
>>>> Thanks,
>>>>
>>>> Brent
>>>>
>>>> On Wed, 4 Apr 2007, Anand Avati wrote:
>>>>
>>>>> Can you confirm that you were NOT using unify int he setup??
>>>>>
>>>>> regards,
>>>>> avati
>>>>>
>>>>>
>>>>> On Thu, Apr 05, 2007 at 01:09:16AM -0400, Brent A Nelson wrote:
>>>>>> Awesome!
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Brent
>>>>>>
>>>>>> On Wed, 4 Apr 2007, Anand Avati wrote:
>>>>>>
>>>>>>> Brent,
>>>>>>> thank you so much for your efforts of sending the output!
>>>>>>> from the log it is clear the leak fd's are all for directories. Indeed
>>>>>>> there was an issue with releasedir() call reaching all the nodes. The
>>>>>>> fix should be committed today to tla.
>>>>>>>
>>>>>>> Thanks!!
>>>>>>>
>>>>>>> avati
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 04, 2007 at 09:18:48PM -0400, Brent A Nelson wrote:
>>>>>>>> I avoided restarting, as this issue would take a while to reproduce.
>>>>>>>>
>>>>>>>> jupiter01 and jupiter02 are mirrors of each other.  All performance
>>>>>>>> translators are in use, except for writebehind (due to the mtime bug).
>>>>>>>>
>>>>>>>> jupiter01:
>>>>>>>> ls -l /proc/26466/fd |wc
>>>>>>>> 65536  655408 7358168
>>>>>>>> See attached for ls -l output.
>>>>>>>>
>>>>>>>> jupiter02:
>>>>>>>> ls -l /proc/3651/fd |wc
>>>>>>>> ls -l /proc/3651/fd
>>>>>>>> total 11
>>>>>>>> lrwx------ 1 root root 64 2007-04-04 20:43 0 -> /dev/null
>>>>>>>> lrwx------ 1 root root 64 2007-04-04 20:43 1 -> /dev/null
>>>>>>>> lrwx------ 1 root root 64 2007-04-04 20:43 10 -> socket:[2565251]
>>>>>>>> lrwx------ 1 root root 64 2007-04-04 20:43 2 -> /dev/null
>>>>>>>> l-wx------ 1 root root 64 2007-04-04 20:43 3 ->
>>>>>>>> /var/log/glusterfs/glusterfsd.log
>>>>>>>> lrwx------ 1 root root 64 2007-04-04 20:43 4 -> socket:[2255275]
>>>>>>>> lrwx------ 1 root root 64 2007-04-04 20:43 5 -> socket:[2249710]
>>>>>>>> lr-x------ 1 root root 64 2007-04-04 20:43 6 -> eventpoll:[2249711]
>>>>>>>> lrwx------ 1 root root 64 2007-04-04 20:43 7 -> socket:[2255306]
>>>>>>>> lr-x------ 1 root root 64 2007-04-04 20:43 8 ->
>>>>>>>> /etc/glusterfs/glusterfs-client.vol
>>>>>>>> lr-x------ 1 root root 64 2007-04-04 20:43 9 ->
>>>>>>>> /etc/glusterfs/glusterfs-client.vol
>>>>>>>>
>>>>>>>> Note that it looks like all those extra directories listed on
>>>>>>>> jupiter01
>>>>>>>> were locally rsynched from jupiter01's Lustre filesystems onto the
>>>>>>>> glusterfs client on jupiter01.  A very large rsync from a different
>>>>>>>> machine to jupiter02 didn't go nuts.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Brent
>>>>>>>>
>>>>>>>> On Wed, 4 Apr 2007, Anand Avati wrote:
>>>>>>>>
>>>>>>>>> Brent,
>>>>>>>>> I hope the system is still in the same state to dig some info out.
>>>>>>>>> To verify that it is a file descriptor leak, can you please run this
>>>>>>>>> test. On the server, run ps ax and get the PID of glusterfsd. then do
>>>>>>>>> an ls -l on /proc/<pid>/fd/ and please mail the output of that. That
>>>>>>>>> should give a precise idea of what is happening.
>>>>>>>>> If the system has been reset out of the state, please give us the
>>>>>>>>> spec file you are using and the commands you ran (of some major jobs
>>>>>>>>> like heavy rsync) so that we will try to reproduce the error in our
>>>>>>>>> setup.
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> avati
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 04, 2007 at 01:12:33PM -0400, Brent A Nelson wrote:
>>>>>>>>>> I put a 2-node GlusterFS mirror into use internally yesterday, as
>>>>>>>>>> GlusterFS was looking pretty solid, and I rsynced a whole bunch of
>>>>>>>>>> stuff
>>>>>>>>>> to it.  Today, however, an ls on any of the three clients gives me:
>>>>>>>>>>
>>>>>>>>>> ls: /backup: Too many open files
>>>>>>>>>>
>>>>>>>>>> It looks like glusterfsd hit a limit.  Is this a bug
>>>>>>>>>> (glusterfs/glusterfsd
>>>>>>>>>> forgetting to close files; essentially, a file descriptor leak), or
>>>>>>>>>> do I
>>>>>>>>>> just need to increase the limit somewhere?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Brent
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Gluster-devel mailing list
>>>>>>>>>> Gluster-devel at nongnu.org
>>>>>>>>>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Shaw's Principle:
>>>>>>>>>    Build a system that even a fool can use,
>>>>>>>>>    and only a fool will want to use it.
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shaw's Principle:
>>>>>>>     Build a system that even a fool can use,
>>>>>>>     and only a fool will want to use it.
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Shaw's Principle:
>>>>>      Build a system that even a fool can use,
>>>>>      and only a fool will want to use it.
>>>>>
>>>>
>>>
>>> --
>>> Shaw's Principle:
>>>       Build a system that even a fool can use,
>>>       and only a fool will want to use it.
>>>
>>
>
> -- 
> Shaw's Principle:
>        Build a system that even a fool can use,
>        and only a fool will want to use it.
>





More information about the Gluster-devel mailing list