[Gluster-devel] Readdir d_off encoding

Wed Jan 7 21:16:50 UTC 2015

On Mon, Dec 22, 2014 at 02:04:37PM -0500, J. Bruce Fields wrote:
> It'd also be nice to see any proposals for a completely correct
> solution, even if it's something that will take a while.  All I can
> think of is protocol extensions, but that's just what I know.

I tried to think a little about this over the holidays: say we could
scrap NFS and start from scratch, what would we do?:

- larger NFS readdir cookies: if NFS cookies were 128 bits, then gluster
  could stick the filesystem's offset in the lower 64 bits and its own
  data in the upper 64 bits.

  This doesn't work if anyone else does this, though: if we change to
  128 bits here then people may eventually want to do the same thing to
  filesystem and systemcall interfaces too and then we're back at square
  one.  If people want to be able to stack arbitrary readdir
  implementations the we can't really choose a fixed size limit any
  more.

- stateful readdir: make clients open the directory, read through it
  from start to finish, then close it.  That's all clients really want
  to do anyway--they don't need to seek back to offsets returned
  arbitrarily long ago.  However, they do need to be able to resend the
  last readdir request in case the reply was lost, and they do need to
  be able to resume reading a directory after a server reboot.

  So I think that would still leave gluster needing to keep a
  (persistent, on-disk) cache mapping the NFS cookies it hands out to
  the offsets in the backend directories.  The difference is just that
  it would only have to cache the small number of entries that are in
  use by current readdirs in progress instead of potentially having to
  keep them all forever.  I don't know, does that help much?

--b.