[Gluster-devel] regressions due to 64-bit ext4 directory cookies

Wed Feb 13 15:36:54 UTC 2013

On Wed, Feb 13, 2013 at 10:19:53AM -0500, J. Bruce Fields wrote:
> > > (In more detail: they're spreading a single directory across multiple
> > > nodes, and encoding a node ID into the cookie they return, so they can
> > > tell which node the cookie came from when they get it back.)
> > > 
> > > That works if you assume the cookie is an "offset" bounded above by some
> > > measure of the directory size, hence unlikely to ever use the high
> > > bits....
> > 
> > Right, but why wouldn't a nfs export option solave the problem for
> > gluster?
> 
> No, gluster is running on ext4 directly.

OK, so let me see if I can get this straight.  Each local gluster node
is running a userspace NFS server, right?  Because if it were running
a kernel-side NFS server, it would be sufficient to use an nfs export
option.

A client which mounts a "gluster file system" is also doing this via
NFSv3, right?  Or are they using their own protocol?  If they are
using their own protocol, why can't they encode the node ID somewhere
else?

So this a correct picture of what is going on:

                                                  /------ GFS Storage
                                                 /        Server #1
  GFS Cluster     NFS V3      GFS Cluster      -- NFS v3
  Client        <--------->   Frontend Server  ---------- GFS Storage
                                               --         Server #2
                                                 \
                                                  \------ GFS Storage
                                                          Server #3

And the reason why it needs to use the high bits is because when it
needs to coalesce the results from each GFS Storage Server to the GFS
Cluster client?

The other thing that I'd note is that the readdir cookie has been
64-bit since NFSv3, which was released in June ***1995***.  And the
explicit, stated purpose of making it be a 64-bit value (as stated in
RFC 1813) was to reduce interoperability problems.  If that were the
case, are you telling me that Sun (who has traditionally been pretty
good worrying about interoperability concerns, and in fact employed
the editors of RFC 1813) didn't get this right?  This seems
quite.... surprising to me.

I thought this was the whole point of the various NFS interoperability
testing done at Connectathon, for which Sun was a major sponsor?!?  No
one noticed?!?

	     		      	    	- Ted