[Gluster-devel] Readdir d_off encoding
Xavier Hernandez
xhernandez at datalab.es
Tue Dec 23 10:20:31 UTC 2014
On 12/22/2014 06:41 PM, Jeff Darcy wrote:
>> An alternative would be to convert directories into regular files from
>> the brick point of view.
>>
>> The benefits of this would be:
>>
>> * d_off would be controlled by gluster, so all bricks would have the
>> same d_off and order. No need to use any d_off mapping or transformation.
>
> I don't think a full-out change from real directories to virtual ones is
> in the cards, but a variant of this idea might be worth exploring further.
> If we had a *server side* component to map between on-disk d_off values
> and those we present to clients, then it might be able to do a better job
> than the local FS of ensuring uniqueness within the bits (e.g. 48 of them)
> that are left over after we subtract some for a brick ID. This could be
> enough to make the bit-stealing approach (on the client) viable. There
> are probably some issues with failing over between replicas, which should
> have the same files but might not have assigned the same internal d_off
> values, but those issues might be avoidable if the d_off values are
> deterministic with respect to GFIDs.
Having a server-side xlator seems a better approximation, however I see
some problems that need to be solved:
The mapper should work on the fly (i.e. it should do the mapping between
the local d_off to the client d_off without having full knowledge of the
directory contents). This is a good approach for really big directories
because doesn't require to waste large amounts of memory, but it will be
hard to find a way to avoid duplicates, specially if are limited to ~48
bits.
Making it based on the GFID would be a good way to have common d_off
between bricks, however maintaining order will be harder. It will also
be hard to guarantee uniqueness if mapping is deterministic and
directory is very big. Otherwise it would need to read full directory
contents before returning mapped d_off's.
To minimize the collision problem, we need to solve the ordering
problem. If we can guarantee that all bricks return directory entries in
the same order and d_off, we don't need to reserve some bits in d_off.
I think the virtual directories solution should be the one to consider
for 4.0. For earlier versions we can try to find an intermediate solution.
Following your idea of a server side component, could this be useful ?
* Keep all directories and its entries in a double linked list stored in
xattr of each inode.
* Use this linked list to build the readdir answer.
* Use the first 64 (or 63) bits of gfid as the d_off.
* There will be two special offsets: 0 for '.' and 1 for '..'
Example (using shorter gfid's for simplicity):
Directory root with gfid 0001
Directory 'test1' inside root with gfid 1111
Directory 'test2' inside root with gfid 2222
Entry 'entry1' inside 'test1' with gfid 3333
Entry 'entry2' inside 'test1' with gfid 4444
Entry 'entry3' inside 'test2' with gfid 4444
Entry 'entry4' inside 'test2' with gfid 5555
Entry 'entry5' inside 'test2' with gfid 6666
/ (0001)
test1/ (1111)
entry1 (3333)
entry2 (4444)
test2/ (2222)
entry3 (4444)
entry4 (5555)
entry5 (6666)
Note that entry2 and entry3 are hardlinks.
xattrs of root (0001):
trusted.dirmap.0001.next = 1111
trusted.dirmap.0001.prev = 2222
xattrs of 'test1' (1111):
trusted.dirmap.0001.next = 2222
trusted.dirmap.0001.prev = 0001
trusted.dirmap.1111.next = 3333
trusted.dirmap.1111.prev = 4444
xattrs of 'test2' (2222):
trusted.dirmap.0001.next = 0001
trusted.dirmap.0001.prev = 1111
trusted.dirmap.2222.next = 4444
trusted.dirmap.2222.prev = 6666
xattrs of 'entry1' (3333):
trusted.dirmap.1111.next = 4444
trusted.dirmap.1111.prev = 1111
xattrs of 'entry2'/'entry3' (4444):
trusted.dirmap.1111.next = 1111
trusted.dirmap.1111.prev = 3333
trusted.dirmap.2222.next = 5555
trusted.dirmap.2222.prev = 2222
xattrs of 'entry4' (5555):
trusted.dirmap.2222.next = 6666
trusted.dirmap.2222.prev = 4444
xattrs of 'entry5' (6666):
trusted.dirmap.2222.next = 2222
trusted.dirmap.2222.prev = 5555
It's easy to enumerate all entries from the beginning of a directory.
Also, since we return extra information from each inode in a directory,
accessing these new xattrs doesn't represent a big impact.
Given a random d_off, it's relatively easy to find a gfid that starts
with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It
could also be easy to implement the "nearest" of d_off.
This mechanism guarantees the same order and d_off on all bricks
(assuming that directory modification operations are done with some
locks held), and all management is made locally to each brick, being
transparent to the clients. It also has a very low probability of
collisions (we can use 63 or 64 bits for d_off instead of 48) and does
not require any transformation in dht/afr/ec.
However some special operation would need to be implemented to allow
healing procedures to correctly add a directory entry in the correct
place to maintain order between bricks when things are not ok.
It adds some complexity to ensure integrity, but since this job is done
locally on each brick, it seems easier to maintain.
Xavi
More information about the Gluster-devel
mailing list