[Gluster-devel] Readdir d_off encoding

Tue Dec 23 10:20:31 UTC 2014

On 12/22/2014 06:41 PM, Jeff Darcy wrote:
>> An alternative would be to convert directories into regular files from
>> the brick point of view.
>>
>> The benefits of this would be:
>>
>> * d_off would be controlled by gluster, so all bricks would have the
>> same d_off and order. No need to use any d_off mapping or transformation.
>
> I don't think a full-out change from real directories to virtual ones is
> in the cards, but a variant of this idea might be worth exploring further.
> If we had a *server side* component to map between on-disk d_off values
> and those we present to clients, then it might be able to do a better job
> than the local FS of ensuring uniqueness within the bits (e.g. 48 of them)
> that are left over after we subtract some for a brick ID.  This could be
> enough to make the bit-stealing approach (on the client) viable.  There
> are probably some issues with failing over between replicas, which should
> have the same files but might not have assigned the same internal d_off
> values, but those issues might be avoidable if the d_off values are
> deterministic with respect to GFIDs.

Having a server-side xlator seems a better approximation, however I see 
some problems that need to be solved:

The mapper should work on the fly (i.e. it should do the mapping between 
the local d_off to the client d_off without having full knowledge of the 
directory contents). This is a good approach for really big directories 
because doesn't require to waste large amounts of memory, but it will be 
hard to find a way to avoid duplicates, specially if are limited to ~48 
bits.

Making it based on the GFID would be a good way to have common d_off 
between bricks, however maintaining order will be harder. It will also 
be hard to guarantee uniqueness if mapping is deterministic and 
directory is very big. Otherwise it would need to read full directory 
contents before returning mapped d_off's.

To minimize the collision problem, we need to solve the ordering 
problem. If we can guarantee that all bricks return directory entries in 
the same order and d_off, we don't need to reserve some bits in d_off.

I think the virtual directories solution should be the one to consider 
for 4.0. For earlier versions we can try to find an intermediate solution.

Following your idea of a server side component, could this be useful ?

* Keep all directories and its entries in a double linked list stored in 
xattr of each inode.

* Use this linked list to build the readdir answer.

* Use the first 64 (or 63) bits of gfid as the d_off.

* There will be two special offsets: 0 for '.' and 1 for '..'

Example (using shorter gfid's for simplicity):

Directory root with gfid 0001
Directory 'test1' inside root with gfid 1111
Directory 'test2' inside root with gfid 2222
Entry 'entry1' inside 'test1' with gfid 3333
Entry 'entry2' inside 'test1' with gfid 4444
Entry 'entry3' inside 'test2' with gfid 4444
Entry 'entry4' inside 'test2' with gfid 5555
Entry 'entry5' inside 'test2' with gfid 6666

/ (0001)
   test1/ (1111)
     entry1 (3333)
     entry2 (4444)
   test2/ (2222)
     entry3 (4444)
     entry4 (5555)
     entry5 (6666)

Note that entry2 and entry3 are hardlinks.

xattrs of root (0001):
    trusted.dirmap.0001.next = 1111
    trusted.dirmap.0001.prev = 2222

xattrs of 'test1' (1111):
    trusted.dirmap.0001.next = 2222
    trusted.dirmap.0001.prev = 0001
    trusted.dirmap.1111.next = 3333
    trusted.dirmap.1111.prev = 4444

xattrs of 'test2' (2222):
    trusted.dirmap.0001.next = 0001
    trusted.dirmap.0001.prev = 1111
    trusted.dirmap.2222.next = 4444
    trusted.dirmap.2222.prev = 6666

xattrs of 'entry1' (3333):
    trusted.dirmap.1111.next = 4444
    trusted.dirmap.1111.prev = 1111

xattrs of 'entry2'/'entry3' (4444):
    trusted.dirmap.1111.next = 1111
    trusted.dirmap.1111.prev = 3333
    trusted.dirmap.2222.next = 5555
    trusted.dirmap.2222.prev = 2222

xattrs of 'entry4' (5555):
    trusted.dirmap.2222.next = 6666
    trusted.dirmap.2222.prev = 4444

xattrs of 'entry5' (6666):
    trusted.dirmap.2222.next = 2222
    trusted.dirmap.2222.prev = 5555

It's easy to enumerate all entries from the beginning of a directory. 
Also, since we return extra information from each inode in a directory, 
accessing these new xattrs doesn't represent a big impact.

Given a random d_off, it's relatively easy to find a gfid that starts 
with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It 
could also be easy to implement the "nearest" of d_off.

This mechanism guarantees the same order and d_off on all bricks 
(assuming that directory modification operations are done with some 
locks held), and all management is made locally to each brick, being 
transparent to the clients. It also has a very low probability of 
collisions (we can use 63 or 64 bits for d_off instead of 48) and does 
not require any transformation in dht/afr/ec.

However some special operation would need to be implemented to allow 
healing procedures to correctly add a directory entry in the correct 
place to maintain order between bricks when things are not ok.

It adds some complexity to ensure integrity, but since this job is done 
locally on each brick, it seems easier to maintain.

Xavi