[Gluster-devel] Readdir d_off encoding
J. Bruce Fields
bfields at fieldses.org
Mon Dec 22 19:04:37 UTC 2014
On Mon, Dec 22, 2014 at 12:04:03PM -0500, Jeff Darcy wrote:
> > > The situation gets even worse if the bit-stealing is done at other
> > > levels than at the bricks, and I haven't seen any such proposals that
> > > deal with issues such as needing to renumber when disks are added or
> > > removed. At scale, that's going to happen a lot. The numbers get
> > > worse again if we split bricks ourselves, and I haven't seen any
> > > proposals to do things that we need to do any other way. Also, the
> > > failure mode with this approach - infinite looping in readdir,
> > > possibly even in our own daemons - is pretty catastrophic.
> >
> > Any recent Linux client at least should just fail in this case
>
> Why would it just fail? It's continuing to receive (what appear to be)
> valid entries. Is there code in the Linux NFS client to detect loops
> or duplicates?
Yes, exactly.
> > and it
> > shouldn't be hard to similarly fix any such daemons to detect loops and
> > minimize the damage. (Though there still may be clients you can't fix.)
>
> We can certainly detect loops in our own daemons, at the cost of adding
> yet another secondary fix for problems introduced by the primary one. We
> can almost as certainly not fix all clients that our users might deploy.
> That includes older Linux clients, BSD clients, Mac clients, Windows
> clients, and who-knows-what more exotic beasties.
Agreed. Well, I haven't actually tested any clients, and I'd consider
the failure to handle a loop a (mild) client bug, but I wouldn't be
surprised if it's a common bug.
> > > By contrast, the failure mode for the map-caching approach - a simple
> > > failure in readdir - is relatively benign. Such failures are also
> > > likely to be less common, even if we adopt the *unprecedented*
> > > requirement that the cache be strictly space-limited. If we relax that
> > > requirement, the problem goes away entirely.
> >
> > Note NFS clients normally expect to be able to survive server reboots,
> > so a complete solution requires a persistent cache.
>
> It's not ideal that an NFS server (GlusterFS client) crash would result
> in an NFS client's readdir failing. On the other hand, one might
> reasonably expect such events to be very rare, and not to recur every
> time somebody tries to access the same directory. If I were a storage
> administrator, I'd prefer that scenario to one in which clients (or
> daemons) repeatedly spin out of control as long as the directory is
> subject to an unpredictable condition (entries hashing to the same N
> bits).
>
> > My worry is that the map-caching solution will be more complicated and
> > also have some failures in odd corner cases.
>
> Yes, it will add complexity. It might have odd corner cases. On the
> other hand, the bit-stealing approach also adds complexity and our users
> are already suffering from failures in what can no longer be called
> corner cases. 64 bits just isn't enough for both a sufficiently large
> brick number and a sufficiently collision-resistant hash. Even if we
> could get d_off to expand to 128 bits, we wouldn't be able to rely on
> that for years. Therefore, even if we solve issues like brick
> renumbering, we'll be stuck in this infinite loop having this same
> conversation every year or so until we change our approach. However
> inconvenient or imperfect an alternative might be, it's our only way
> forward.
Maybe. Could we get a sketch of the design with a good description of
the failure cases?
It'd also be nice to see any proposals for a completely correct
solution, even if it's something that will take a while. All I can
think of is protocol extensions, but that's just what I know.
I don't love the bit-stealing hack either, but in practice keep in mind
this all seems to be about ext4. If you want reliable nfs readdir with
16M-entry directories and all the rest you can get that already with
xfs.
--b.
More information about the Gluster-devel
mailing list