[Gluster-devel] Readdir d_off encoding

Tue Dec 23 18:34:41 UTC 2014

Using GFID does not work for d_off. The GFID represents and inode, and a
d_off represents a directory entry. Therefore using GFID as an alternative
to d_off breaks down when you have hardlinks for the same inode in a single
directory.

On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez <xhernandez at datalab.es>
wrote:

> On 12/22/2014 06:41 PM, Jeff Darcy wrote:
> >> An alternative would be to convert directories into regular files from
> >> the brick point of view.
> >>
> >> The benefits of this would be:
> >>
> >> * d_off would be controlled by gluster, so all bricks would have the
> >> same d_off and order. No need to use any d_off mapping or
> transformation.
> >
> > I don't think a full-out change from real directories to virtual ones is
> > in the cards, but a variant of this idea might be worth exploring
> further.
> > If we had a *server side* component to map between on-disk d_off values
> > and those we present to clients, then it might be able to do a better job
> > than the local FS of ensuring uniqueness within the bits (e.g. 48 of
> them)
> > that are left over after we subtract some for a brick ID.  This could be
> > enough to make the bit-stealing approach (on the client) viable.  There
> > are probably some issues with failing over between replicas, which should
> > have the same files but might not have assigned the same internal d_off
> > values, but those issues might be avoidable if the d_off values are
> > deterministic with respect to GFIDs.
>
> Having a server-side xlator seems a better approximation, however I see
> some problems that need to be solved:
>
> The mapper should work on the fly (i.e. it should do the mapping between
> the local d_off to the client d_off without having full knowledge of the
> directory contents). This is a good approach for really big directories
> because doesn't require to waste large amounts of memory, but it will be
> hard to find a way to avoid duplicates, specially if are limited to ~48
> bits.
>
> Making it based on the GFID would be a good way to have common d_off
> between bricks, however maintaining order will be harder. It will also
> be hard to guarantee uniqueness if mapping is deterministic and
> directory is very big. Otherwise it would need to read full directory
> contents before returning mapped d_off's.
>
> To minimize the collision problem, we need to solve the ordering
> problem. If we can guarantee that all bricks return directory entries in
> the same order and d_off, we don't need to reserve some bits in d_off.
>
> I think the virtual directories solution should be the one to consider
> for 4.0. For earlier versions we can try to find an intermediate solution.
>
> Following your idea of a server side component, could this be useful ?
>
> * Keep all directories and its entries in a double linked list stored in
> xattr of each inode.
>
> * Use this linked list to build the readdir answer.
>
> * Use the first 64 (or 63) bits of gfid as the d_off.
>
> * There will be two special offsets: 0 for '.' and 1 for '..'
>
> Example (using shorter gfid's for simplicity):
>
> Directory root with gfid 0001
> Directory 'test1' inside root with gfid 1111
> Directory 'test2' inside root with gfid 2222
> Entry 'entry1' inside 'test1' with gfid 3333
> Entry 'entry2' inside 'test1' with gfid 4444
> Entry 'entry3' inside 'test2' with gfid 4444
> Entry 'entry4' inside 'test2' with gfid 5555
> Entry 'entry5' inside 'test2' with gfid 6666
>
> / (0001)
>    test1/ (1111)
>      entry1 (3333)
>      entry2 (4444)
>    test2/ (2222)
>      entry3 (4444)
>      entry4 (5555)
>      entry5 (6666)
>
> Note that entry2 and entry3 are hardlinks.
>
> xattrs of root (0001):
>     trusted.dirmap.0001.next = 1111
>     trusted.dirmap.0001.prev = 2222
>
> xattrs of 'test1' (1111):
>     trusted.dirmap.0001.next = 2222
>     trusted.dirmap.0001.prev = 0001
>     trusted.dirmap.1111.next = 3333
>     trusted.dirmap.1111.prev = 4444
>
> xattrs of 'test2' (2222):
>     trusted.dirmap.0001.next = 0001
>     trusted.dirmap.0001.prev = 1111
>     trusted.dirmap.2222.next = 4444
>     trusted.dirmap.2222.prev = 6666
>
> xattrs of 'entry1' (3333):
>     trusted.dirmap.1111.next = 4444
>     trusted.dirmap.1111.prev = 1111
>
> xattrs of 'entry2'/'entry3' (4444):
>     trusted.dirmap.1111.next = 1111
>     trusted.dirmap.1111.prev = 3333
>     trusted.dirmap.2222.next = 5555
>     trusted.dirmap.2222.prev = 2222
>
> xattrs of 'entry4' (5555):
>     trusted.dirmap.2222.next = 6666
>     trusted.dirmap.2222.prev = 4444
>
> xattrs of 'entry5' (6666):
>     trusted.dirmap.2222.next = 2222
>     trusted.dirmap.2222.prev = 5555
>
> It's easy to enumerate all entries from the beginning of a directory.
> Also, since we return extra information from each inode in a directory,
> accessing these new xattrs doesn't represent a big impact.
>
> Given a random d_off, it's relatively easy to find a gfid that starts
> with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It
> could also be easy to implement the "nearest" of d_off.
>
> This mechanism guarantees the same order and d_off on all bricks
> (assuming that directory modification operations are done with some
> locks held), and all management is made locally to each brick, being
> transparent to the clients. It also has a very low probability of
> collisions (we can use 63 or 64 bits for d_off instead of 48) and does
> not require any transformation in dht/afr/ec.
>
> However some special operation would need to be implemented to allow
> healing procedures to correctly add a directory entry in the correct
> place to maintain order between bricks when things are not ok.
>
> It adds some complexity to ensure integrity, but since this job is done
> locally on each brick, it seems easier to maintain.
>
> Xavi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20141223/8d7d3250/attachment-0001.html>