[Gluster-devel] How to fix wrong telldir/seekdir usage

Sun Sep 14 04:50:34 UTC 2014

On 09/14/2014 12:32 AM, Emmanuel Dreyfus wrote:
> In <1lrx1si.n8tms1igmi5pM%manu at netbsd.org> I explained why NetBSD
> currently fails self-heald.t, but since the subjet is burried deep in a
> thread, it might be worth starting a new one to talk about how to fix.
>
> In 3 places within glusterfs code (features/index,
> features/snapview-server and storage/posix), a server component answers
> readdir requests on a directory which may be split in mulitple calls.
>
> To answer one call, we have the following library calls:
> - opendir()
> - seekdir() to resume where the previous request was
> - readdir()
> - telldir() to record where we are for the next request
> - closedir()
>
> This relies on unspecified behavior, as POSIX says: "The value of loc
> should have been returned from an earlier call to telldir() using the
> same directory stream."
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/seekdir.html
>
> Since we do opendir() and closedir() at each time, we do not use the
> same directory stream. It causes an infinite loop on NetBSD because it
> badly resume from previous request, and in the general case it will
> break badly if an entry is added in the directory between two requests.
>
> How can we fix that?
>
> 1) we can keep the directory stream open. The change is intrusive since
> we will need a chained list of open contexts, and we need to clean them
> if they timeout.
>
> 2) in order to keep state between requests, we can use the entry index
> (first encoutered is 1, and so on) instead of values returned by
> telldir(). That works around the unspecified behavior, but it still
> breaks if directory content is changed between two requests
>
> 3) make sure the readdir is done in a single request. That means trying
> with bigger buffers until it works. For instance  in
> xlator/cluster/afr/src/afr-self-heald.c we have:
>     while ((ret = syncop_readdir (subvol, fd, 131072, offset, &entries)))
>
> We would use -1 instead of 131072 to tell that we want everything
> without a size limit, and the server component (here features/index)
> would either return everyting or fail, whithout playing with
> telldir/seekdir.
>
> Opinions? The third solution seems the best to me since it is not very
> intrusive and it makes things simplier. Indeed we allow unbound data
> size to come back from the brick to glustershd, but we trust the brick,
> right?
I saw cases where the number of entries in the index directory was close 
to 100000. If brick process tries to read everything it will be OOM 
killed by the kernel.

Just to make sure I understand the problem, the issue is happening 
because self-heal-daemon uses anonymous fds to perform readdirs? i.e. 
there is no explicit opendir on the directory. Everytime there is a 
readdir it may lead to opendir/seekdir/readdir/closedir. Did I get that 
right? I believe posix xlator doesn't have this problem for 
non-anonymous fds where the DIR* stream is open till the final unref on 
the fd.

Pranith
>