[Gluster-devel] Readdir d_off encoding

Mon Dec 22 19:04:37 UTC 2014

On Mon, Dec 22, 2014 at 12:04:03PM -0500, Jeff Darcy wrote:
> > > The situation gets even worse if the bit-stealing is done at other
> > > levels than at the bricks, and I haven't seen any such proposals that
> > > deal with issues such as needing to renumber when disks are added or
> > > removed.  At scale, that's going to happen a lot.  The numbers get
> > > worse again if we split bricks ourselves, and I haven't seen any
> > > proposals to do things that we need to do any other way.  Also, the
> > > failure mode with this approach - infinite looping in readdir,
> > > possibly even in our own daemons - is pretty catastrophic.
> > 
> > Any recent Linux client at least should just fail in this case
> 
> Why would it just fail?  It's continuing to receive (what appear to be)
> valid entries.  Is there code in the Linux NFS client to detect loops
> or duplicates?

Yes, exactly.

> > and it
> > shouldn't be hard to similarly fix any such daemons to detect loops and
> > minimize the damage.  (Though there still may be clients you can't fix.)
> 
> We can certainly detect loops in our own daemons, at the cost of adding
> yet another secondary fix for problems introduced by the primary one.  We
> can almost as certainly not fix all clients that our users might deploy.
> That includes older Linux clients, BSD clients, Mac clients, Windows
> clients, and who-knows-what more exotic beasties.

Agreed.  Well, I haven't actually tested any clients, and I'd consider
the failure to handle a loop a (mild) client bug, but I wouldn't be
surprised if it's a common bug.

> > > By contrast, the failure mode for the map-caching approach - a simple
> > > failure in readdir - is relatively benign.  Such failures are also
> > > likely to be less common, even if we adopt the *unprecedented*
> > > requirement that the cache be strictly space-limited.  If we relax that
> > > requirement, the problem goes away entirely.
> > 
> > Note NFS clients normally expect to be able to survive server reboots,
> > so a complete solution requires a persistent cache.
> 
> It's not ideal that an NFS server (GlusterFS client) crash would result
> in an NFS client's readdir failing.  On the other hand, one might
> reasonably expect such events to be very rare, and not to recur every
> time somebody tries to access the same directory.  If I were a storage
> administrator, I'd prefer that scenario to one in which clients (or
> daemons) repeatedly spin out of control as long as the directory is
> subject to an unpredictable condition (entries hashing to the same N
> bits).
> 
> > My worry is that the map-caching solution will be more complicated and
> > also have some failures in odd corner cases.
> 
> Yes, it will add complexity.  It might have odd corner cases.  On the
> other hand, the bit-stealing approach also adds complexity and our users
> are already suffering from failures in what can no longer be called
> corner cases.  64 bits just isn't enough for both a sufficiently large
> brick number and a sufficiently collision-resistant hash.  Even if we
> could get d_off to expand to 128 bits, we wouldn't be able to rely on
> that for years.  Therefore, even if we solve issues like brick
> renumbering, we'll be stuck in this infinite loop having this same
> conversation every year or so until we change our approach.  However
> inconvenient or imperfect an alternative might be, it's our only way
> forward.

Maybe.  Could we get a sketch of the design with a good description of
the failure cases?

It'd also be nice to see any proposals for a completely correct
solution, even if it's something that will take a while.  All I can
think of is protocol extensions, but that's just what I know.

I don't love the bit-stealing hack either, but in practice keep in mind
this all seems to be about ext4.  If you want reliable nfs readdir with
16M-entry directories and all the rest you can get that already with
xfs.

--b.