[Gluster-devel] Readdir d_off encoding

Mon Dec 22 16:03:12 UTC 2014

On Mon, Dec 22, 2014 at 09:30:29AM -0500, Jeff Darcy wrote:
> > The birthday paradox says that with a 44-bit hash we're more likely than
> > not to start seeing collisions somewhere around 2^22 directory entries.
> > That 16-million-entry-directory would have a lot of collisions.
> 
> This is really the key point.  The risks of the bit-stealing approach
> have been understated, and the costs of the map-caching approach
> overstated.  DFS deployments on the order of 20K disks are no longer
> remarkable, and those numbers are only going to increase.  If each disk
> is a brick, which is the most common approach, we'll need *at least* 16
> bits ourselves.  That leaves 48 bits, and a high probability of
> collision at 2^24 or 16M files.  Is a 16M-file directory a good idea?
> Of course not.  Do they exist in the wild?  Definitely yes.

Note at that point the bug's still latent--we don't see it till a client
sends a READDIR with the problematic cookie.  But that may not make a
big difference under some assumptions.[1]

> The situation gets even worse if the bit-stealing is done at other
> levels than at the bricks, and I haven't seen any such proposals that
> deal with issues such as needing to renumber when disks are added or
> removed.  At scale, that's going to happen a lot.  The numbers get
> worse again if we split bricks ourselves, and I haven't seen any
> proposals to do things that we need to do any other way.  Also, the
> failure mode with this approach - infinite looping in readdir,
> possibly even in our own daemons - is pretty catastrophic.

Any recent Linux client at least should just fail in this case, and it
shouldn't be hard to similarly fix any such daemons to detect loops and
minimize the damage.  (Though there still may be clients you can't fix.)

> By contrast, the failure mode for the map-caching approach - a simple
> failure in readdir - is relatively benign.  Such failures are also
> likely to be less common, even if we adopt the *unprecedented*
> requirement that the cache be strictly space-limited.  If we relax that
> requirement, the problem goes away entirely.

Note NFS clients normally expect to be able to survive server reboots,
so a complete solution requires a persistent cache.

> The number of concurrent
> readdirs is orders of magnitude less than the number of files per
> directory.  We should take advantage of that.  Also, we don't have
> problems with renumbering etc.
> 
> The bit-stealing approach seemed clever until the first round of
> failures.  After that first round it seemed less clever.  After the
> second it seems unwise.  After a third it will seem irresponsible.  That
> wording might seem harsh, but anyone who has actually had to stand in
> front of users and explain why this was ever a problem is likely to have
> heard worse.  Some users are reporting these problems *right now*.  Do
> we have any volunteers to ask them whether they'd like us to keep
> pursuing an approach that rests on shaky assumptions and has already
> failed twice?

My worry is that the map-caching solution will be more complicated and
also have some failures in odd corner cases.

But I don't understand the proposal well enough to trust my judgement
here.

--b.

[1] The chance of actually hitting a duplicate cookie depends on client
and application behavior and I think could do anything from:

	- eliminating the bug entirely (if the directory's only used for
	  lookups, not readdirs--perhaps not farfetched if reading 16M
	  directories is painfully slow), to

	- decreasing the probability by a couple orders of magnitude (if
	  we assume clients all read the directory in identically-sized
	  chunks of a few hundred entries), to

	- having no significant effect (if we assume more variation in
	  client behavior and/or directory modifications ensure that
	  given enough directory reads somebody will eventually hit the
	  cookie).