[Gluster-devel] Readdir d_off encoding

Jeff Darcy jdarcy at redhat.com
Mon Dec 22 17:04:03 UTC 2014


> > The situation gets even worse if the bit-stealing is done at other
> > levels than at the bricks, and I haven't seen any such proposals that
> > deal with issues such as needing to renumber when disks are added or
> > removed.  At scale, that's going to happen a lot.  The numbers get
> > worse again if we split bricks ourselves, and I haven't seen any
> > proposals to do things that we need to do any other way.  Also, the
> > failure mode with this approach - infinite looping in readdir,
> > possibly even in our own daemons - is pretty catastrophic.
> 
> Any recent Linux client at least should just fail in this case

Why would it just fail?  It's continuing to receive (what appear to be)
valid entries.  Is there code in the Linux NFS client to detect loops
or duplicates?

> and it
> shouldn't be hard to similarly fix any such daemons to detect loops and
> minimize the damage.  (Though there still may be clients you can't fix.)

We can certainly detect loops in our own daemons, at the cost of adding
yet another secondary fix for problems introduced by the primary one.  We
can almost as certainly not fix all clients that our users might deploy.
That includes older Linux clients, BSD clients, Mac clients, Windows
clients, and who-knows-what more exotic beasties.

> > By contrast, the failure mode for the map-caching approach - a simple
> > failure in readdir - is relatively benign.  Such failures are also
> > likely to be less common, even if we adopt the *unprecedented*
> > requirement that the cache be strictly space-limited.  If we relax that
> > requirement, the problem goes away entirely.
> 
> Note NFS clients normally expect to be able to survive server reboots,
> so a complete solution requires a persistent cache.

It's not ideal that an NFS server (GlusterFS client) crash would result
in an NFS client's readdir failing.  On the other hand, one might
reasonably expect such events to be very rare, and not to recur every
time somebody tries to access the same directory.  If I were a storage
administrator, I'd prefer that scenario to one in which clients (or
daemons) repeatedly spin out of control as long as the directory is
subject to an unpredictable condition (entries hashing to the same N
bits).

> My worry is that the map-caching solution will be more complicated and
> also have some failures in odd corner cases.

Yes, it will add complexity.  It might have odd corner cases.  On the
other hand, the bit-stealing approach also adds complexity and our users
are already suffering from failures in what can no longer be called
corner cases.  64 bits just isn't enough for both a sufficiently large
brick number and a sufficiently collision-resistant hash.  Even if we
could get d_off to expand to 128 bits, we wouldn't be able to rely on
that for years.  Therefore, even if we solve issues like brick
renumbering, we'll be stuck in this infinite loop having this same
conversation every year or so until we change our approach.  However
inconvenient or imperfect an alternative might be, it's our only way
forward.


More information about the Gluster-devel mailing list