[Gluster-devel] About file descriptor leak in glusterfsd daemon after network failure

Mon Aug 25 09:07:33 UTC 2014

Hi Niels,

We have tested the patch for some days. It works well when the gluster peer
status
change to disconnected. However, if we retore the network just before the
peer
status change to disconnected status, we found out that glusterfsd will
still
open a new fd, and leave the old one not released even stop the file
process.

Why does glusterfsd open a new fd instead of reusing the original reopened
fd?
Does glusterfsd have any kind of mechanism retrieve such fds?

2014-08-20 21:54 GMT+08:00 Niels de Vos <ndevos at redhat.com>:

> On Wed, Aug 20, 2014 at 07:16:16PM +0800, Jaden Liang wrote:
> > Hi gluster-devel team,
> >
> > We are running a 2 replica volume in 2 servers. One of our service daemon
> > open a file with 'flock' in the volume. We can see every glusterfsd
> daemon
> > open the replica files in its own server(in /proc/pid/fd). When we pull
> off
> > the cable of one server about 10 minutes then re-plug in. We found that
> the
> > glusterfsd open a 'NEW' file descriptor while still holding the old one
> > which is opened in the first file access.
> >
> > Then we stop our service daemon, but the glusterfsd(the re-plug cable
> one)
> > only closes the new fd, leave the old fd open, we think that may be a fd
> > leak issue. And we restart our service daemon. It flocked the same file,
> > and get a flock failure. The errno is Resource Temporary Unavailable.
> >
> > However, this situation is not replay every time but often come out. We
> are
> > still looking into the source code of glusterfsd, but it is not a easy
> job.
> > So we want to look for some help in here. Here are our questions:
> >
> > 1. Has this issue been solved? Or is it a known issue?
> > 2. Does anyone know the file descriptor maintenance logic in
> > glusterfsd(server-side)? When the fd will be closed or held?
>
> I think you are hitting bug 1129787:
> - https://bugzilla.redhat.com/show_bug.cgi?id=1129787
>    file locks are not released within an acceptable time when
>    a fuse-client uncleanly disconnects
>
> There has been a (short) discussion about this earlier, see
> http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040748.html
>
> Updating the proposed change is on my TODO list, in the end, the
> network.ping-timeout option should be used to define the timeout towards
> storage servers (like it is now) and the timeout from storage server to
> GlusterFS-client.
>
> You can try out the patch at http://review.gluster.org/8065 and see if
> the network.tcp-timeout option works for you. Just remember that the
> option will get fold into the network.ping-timeout one later on. If you
> are interested in sending an updated patch, let me know :)
>
> Cheers,
> Niels
>

-- 
Best regards,
Jaden Liang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140825/acc7a9bd/attachment.html>