[Gluster-devel] opendir/readdir helper

Fri Feb 1 14:37:29 UTC 2013

On 02/01/2013 08:27 AM, Jeff Darcy wrote:
> As we all know, directory-listing performance (or lack thereof) is a bit
> of a sore spot for many GlusterFS users, because it's one of the few
> places where FUSE really does make a difference.  It will probably
> always be a sore spot even with the readdirp changes that are already
> under way.  The next step would be to add a FUSE enhancement to "inject"
> directory entries before they're requested, but that's a lot of work for
> an uncertain outcome.  The FUSE haters among the kernel leadership would
> probably reject such changes without serious consideration, and even if
> I'm wrong about that it's likely to be a long time before they make it
> into the various distributions (not to mention non-Linux platform issues).
> 

I think the missing link at the moment is directory caching. Given the
existing fuse notify store/retrieve interfaces for file data, I think
it's reasonable to say a path for inclusion exists for this kind of data
injection if fuse grew a cache mechanism to actually store the
associated data.

The complexity and time to deliver such a feature is another question.
Not having looked into it in detail I'd agree that it would probably be
a while until 1.) we had it and 2.) it was generally available to distros.

> So, let's think outside the box for a bit.  What about an LD_PRELOAD
> helper?  Believe me, I know all about the problems with LD_PRELOAD, but
> I still can't think of any reasonable use case that requires readdir to
> work across a fork (for example).  The basic idea is that the LD_PRELOAD
> would catch calls to opendir/readdir and match them against paths
> matching GlusterFS volumes.  If a match is found, then it would use
> libgfapi to serve the results, without any FUSE involvement and with
> massive prefetching goodness etc.  Without a match, the helper would
> naturally fall back to the system functions.
> 

One thing I noticed fairly recently is that fuse sends readdir requests
in single page (4k) increments, regardless of the size of the caller's
buffer (an strace shows an ls using a 32k buffer). I have a hack lying
around that I haven't fully tested yet to use multi-page readdir
buffers. That said, I'm pretty sure I saw a reference to something
similar on the fuse list recently where the response was directory
caching in general is preferred to this kind of technique (so this hack
might not be acceptable, but perhaps the directory cache ball is rolling).

Given that, does the prefetching idea apply to a generic translator? It
seems like it could potentially serve a variety of growth paths:

- This wouldn't immediately reduce the cost of fuse, but it seems like
it could at least help in traditional graphs on its own.
- Based on testing the above, perhaps the LD_PRELOAD hack could be built
on top of libgfapi and said translator?
- If directory caching becomes a reality down the road, said translator
could grow support for injection if the associated mount supports it
(i.e., perhaps similar to how we do cache invalidation via md-cache
today) and hopefully make the preload hackery unnecessary.

Brian

> I suspect that this approach would make listings on very large single
> directories many times faster than would ever be possible with FUSE. For
> deeply nested directories we'd need to add some more complexity so that
> we're not going through the whole connection-establishment path
> (including authentication etc.) for each directory separately, but
> that's all pretty well understood pain for pretty obvious gain.
> 
> Any other thoughts?
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel