[Gluster-devel] Readdir d_off encoding

Mon Dec 22 14:30:29 UTC 2014

> The birthday paradox says that with a 44-bit hash we're more likely than
> not to start seeing collisions somewhere around 2^22 directory entries.
> That 16-million-entry-directory would have a lot of collisions.

This is really the key point.  The risks of the bit-stealing approach
have been understated, and the costs of the map-caching approach
overstated.  DFS deployments on the order of 20K disks are no longer
remarkable, and those numbers are only going to increase.  If each disk
is a brick, which is the most common approach, we'll need *at least* 16
bits ourselves.  That leaves 48 bits, and a high probability of
collision at 2^24 or 16M files.  Is a 16M-file directory a good idea?
Of course not.  Do they exist in the wild?  Definitely yes.  The
situation gets even worse if the bit-stealing is done at other levels
than at the bricks, and I haven't seen any such proposals that deal with
issues such as needing to renumber when disks are added or removed.  At
scale, that's going to happen a lot.  The numbers get worse again if we
split bricks ourselves, and I haven't seen any proposals to do things
that we need to do any other way.  Also, the failure mode with this
approach - infinite looping in readdir, possibly even in our own daemons
- is pretty catastrophic.

By contrast, the failure mode for the map-caching approach - a simple
failure in readdir - is relatively benign.  Such failures are also
likely to be less common, even if we adopt the *unprecedented*
requirement that the cache be strictly space-limited.  If we relax that
requirement, the problem goes away entirely.  The number of concurrent
readdirs is orders of magnitude less than the number of files per
directory.  We should take advantage of that.  Also, we don't have
problems with renumbering etc.

The bit-stealing approach seemed clever until the first round of
failures.  After that first round it seemed less clever.  After the
second it seems unwise.  After a third it will seem irresponsible.  That
wording might seem harsh, but anyone who has actually had to stand in
front of users and explain why this was ever a problem is likely to have
heard worse.  Some users are reporting these problems *right now*.  Do
we have any volunteers to ask them whether they'd like us to keep
pursuing an approach that rests on shaky assumptions and has already
failed twice?