[Gluster-devel] regressions due to 64-bit ext4 directory cookies

J. Bruce Fields bfields at fieldses.org
Thu Feb 14 21:46:38 UTC 2013

On Wed, Feb 13, 2013 at 06:44:30PM -0500, Theodore Ts'o wrote:
> On Wed, Feb 13, 2013 at 06:05:11PM -0500, J. Bruce Fields wrote:
> > 
> > Would it be possible to make something work like, for example, a 31-bit
> > hash plus an offset into a hash bucket?
> > 
> > I have trouble thinking about this, partly because I can't remember
> > where to find the requirements for readdir on concurrently modified
> > directories....
> The requires are that for a directory entry which has not been
> modified since the last opendir() or rewindir(), readdir() must return
> that directory entry exactly once.
> For a directory entry which has been added or removed since the last
> opendir() or rewinddir() call, it is undefined whether the directory
> entry is returned once or not at all.  And a rename is defined as a
> add/remove, so it's OK for the old filename and the new file name to
> appear in the readdir() stream; it would also be OK if neither
> appeared in the readdir() stream.

That's what I couldn't remember, thanks!


> The SUSv3 definition of readdir() can be found here:
>    http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html
> Note also that if you look at the SuSv3 definition of seekdir(), it
> explicitly states that the value returned by telldir() is not
> guaranteed to be valid after a rewinddir() or across another opendir():
>    If the value of loc was not obtained from an earlier call to
>    telldir(), or if a call to rewinddir() occurred between the call to
>    telldir() and the call to seekdir(), the results of subsequent
>    calls to readdir() are unspecified.
> Hence, it would be legal, and arguably more correct, if we created an
> internal array of pointers into the directory structure, where the
> first call to telldir() return 1, and the second call to telldir()
> returned 2, and the third call to telldir() returned 3, regardless of
> the position in the directory, and this number was used by seekdir()
> to index into the array of pointers to return the exact location in
> the b-tree.  This would completely eliminate the possibility of hash
> collisions, and guarantee that readdir() would never drop or return a
> directory entry multiple times after seekdir().
> This implementation approach would have a potential denial of service
> potential since each call to telldir() would potentially be allocating
> kernel memory, but as long as we make sure the OOM killler kills the
> nasty process which is calling telldir() a lot, this would probably be
> OK.
> It would also be legal to throw away this array after a call to
> rewinddir() and closedir(), since telldir() cookies and not guaranteed
> to valid indefinitely.  See:
>    http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html
> I suspect this would seriously screw over Gluster, though, and this
> wouldn't be a solution for NFSv3, since NFS needs long-lived directory
> cookies, and not the short-lived cookies which is all POSIX/SuSv3 guarantees.
> Regards,
> 					- Ted

More information about the Gluster-devel mailing list