[Gluster-devel] Readdir d_off encoding

Anand Avati avati at gluster.org
Tue Dec 23 18:36:11 UTC 2014


Please review http://review.gluster.org/9332/, as it undoes the
introduction of itransform on d_off in AFR. This does not solve
DHT-over-DHT or other future use cases, but at least fixes the regression
in 3.6.x.

Thanks

On Tue Dec 23 2014 at 10:34:41 AM Anand Avati <avati at gluster.org> wrote:

> Using GFID does not work for d_off. The GFID represents and inode, and a
> d_off represents a directory entry. Therefore using GFID as an alternative
> to d_off breaks down when you have hardlinks for the same inode in a single
> directory.
>
> On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez <xhernandez at datalab.es>
> wrote:
>
>> On 12/22/2014 06:41 PM, Jeff Darcy wrote:
>> >> An alternative would be to convert directories into regular files from
>> >> the brick point of view.
>> >>
>> >> The benefits of this would be:
>> >>
>> >> * d_off would be controlled by gluster, so all bricks would have the
>> >> same d_off and order. No need to use any d_off mapping or
>> transformation.
>> >
>> > I don't think a full-out change from real directories to virtual ones is
>> > in the cards, but a variant of this idea might be worth exploring
>> further.
>> > If we had a *server side* component to map between on-disk d_off values
>> > and those we present to clients, then it might be able to do a better
>> job
>> > than the local FS of ensuring uniqueness within the bits (e.g. 48 of
>> them)
>> > that are left over after we subtract some for a brick ID.  This could be
>> > enough to make the bit-stealing approach (on the client) viable.  There
>> > are probably some issues with failing over between replicas, which
>> should
>> > have the same files but might not have assigned the same internal d_off
>> > values, but those issues might be avoidable if the d_off values are
>> > deterministic with respect to GFIDs.
>>
>> Having a server-side xlator seems a better approximation, however I see
>> some problems that need to be solved:
>>
>> The mapper should work on the fly (i.e. it should do the mapping between
>> the local d_off to the client d_off without having full knowledge of the
>> directory contents). This is a good approach for really big directories
>> because doesn't require to waste large amounts of memory, but it will be
>> hard to find a way to avoid duplicates, specially if are limited to ~48
>> bits.
>>
>> Making it based on the GFID would be a good way to have common d_off
>> between bricks, however maintaining order will be harder. It will also
>> be hard to guarantee uniqueness if mapping is deterministic and
>> directory is very big. Otherwise it would need to read full directory
>> contents before returning mapped d_off's.
>>
>> To minimize the collision problem, we need to solve the ordering
>> problem. If we can guarantee that all bricks return directory entries in
>> the same order and d_off, we don't need to reserve some bits in d_off.
>>
>> I think the virtual directories solution should be the one to consider
>> for 4.0. For earlier versions we can try to find an intermediate solution.
>>
>> Following your idea of a server side component, could this be useful ?
>>
>> * Keep all directories and its entries in a double linked list stored in
>> xattr of each inode.
>>
>> * Use this linked list to build the readdir answer.
>>
>> * Use the first 64 (or 63) bits of gfid as the d_off.
>>
>> * There will be two special offsets: 0 for '.' and 1 for '..'
>>
>> Example (using shorter gfid's for simplicity):
>>
>> Directory root with gfid 0001
>> Directory 'test1' inside root with gfid 1111
>> Directory 'test2' inside root with gfid 2222
>> Entry 'entry1' inside 'test1' with gfid 3333
>> Entry 'entry2' inside 'test1' with gfid 4444
>> Entry 'entry3' inside 'test2' with gfid 4444
>> Entry 'entry4' inside 'test2' with gfid 5555
>> Entry 'entry5' inside 'test2' with gfid 6666
>>
>> / (0001)
>>    test1/ (1111)
>>      entry1 (3333)
>>      entry2 (4444)
>>    test2/ (2222)
>>      entry3 (4444)
>>      entry4 (5555)
>>      entry5 (6666)
>>
>> Note that entry2 and entry3 are hardlinks.
>>
>> xattrs of root (0001):
>>     trusted.dirmap.0001.next = 1111
>>     trusted.dirmap.0001.prev = 2222
>>
>> xattrs of 'test1' (1111):
>>     trusted.dirmap.0001.next = 2222
>>     trusted.dirmap.0001.prev = 0001
>>     trusted.dirmap.1111.next = 3333
>>     trusted.dirmap.1111.prev = 4444
>>
>> xattrs of 'test2' (2222):
>>     trusted.dirmap.0001.next = 0001
>>     trusted.dirmap.0001.prev = 1111
>>     trusted.dirmap.2222.next = 4444
>>     trusted.dirmap.2222.prev = 6666
>>
>> xattrs of 'entry1' (3333):
>>     trusted.dirmap.1111.next = 4444
>>     trusted.dirmap.1111.prev = 1111
>>
>> xattrs of 'entry2'/'entry3' (4444):
>>     trusted.dirmap.1111.next = 1111
>>     trusted.dirmap.1111.prev = 3333
>>     trusted.dirmap.2222.next = 5555
>>     trusted.dirmap.2222.prev = 2222
>>
>> xattrs of 'entry4' (5555):
>>     trusted.dirmap.2222.next = 6666
>>     trusted.dirmap.2222.prev = 4444
>>
>> xattrs of 'entry5' (6666):
>>     trusted.dirmap.2222.next = 2222
>>     trusted.dirmap.2222.prev = 5555
>>
>> It's easy to enumerate all entries from the beginning of a directory.
>> Also, since we return extra information from each inode in a directory,
>> accessing these new xattrs doesn't represent a big impact.
>>
>> Given a random d_off, it's relatively easy to find a gfid that starts
>> with d_off and belongs to the directory (using .glusterfs/xx/yy/...). It
>> could also be easy to implement the "nearest" of d_off.
>>
>> This mechanism guarantees the same order and d_off on all bricks
>> (assuming that directory modification operations are done with some
>> locks held), and all management is made locally to each brick, being
>> transparent to the clients. It also has a very low probability of
>> collisions (we can use 63 or 64 bits for d_off instead of 48) and does
>> not require any transformation in dht/afr/ec.
>>
>> However some special operation would need to be implemented to allow
>> healing procedures to correctly add a directory entry in the correct
>> place to maintain order between bricks when things are not ok.
>>
>> It adds some complexity to ensure integrity, but since this job is done
>> locally on each brick, it seems easier to maintain.
>>
>> Xavi
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20141223/eb4e5344/attachment.html>


More information about the Gluster-devel mailing list