[Gluster-devel] Readdir d_off encoding

Xavier Hernandez xhernandez at datalab.es
Thu Dec 18 16:39:20 UTC 2014


On 12/18/2014 05:11 PM, Shyam wrote:
> On 12/17/2014 05:04 AM, Xavier Hernandez wrote:
>> Just to consider all possibilities...
>>
>> Current architecture needs to create all directory structure on all
>> bricks, and has the big problem that each directory in each brick will
>> store the files in different order and with different d_off values.
>
> I gather that this is when EC or AFR is in place, as for DHT a file is
> on one brick only.

Files are only in one brick, but directories are in all bricks. This is 
independent fram having ec or afr in place.

This makes directory access quite complex in some cases. For example if 
a readdir is made on one brick and that brick dies, the next readdir 
cannot be continued on another brick, at least without having to do some 
complex handling. This is the consequence of having a directory on each 
brick like if they were replicated, but this directories are not exactly 
equal.

Also this architecture forces ec to have directories replicated. This 
adds complexities

>
>>
>> This is a serious scalability issue and have many inconveniences when
>> trying to heal or detect inconsistencies between bricks (basically we
>> would need to read full directory contents of each brick to compare
>> them).
>
> I am not quite familiar with EC so pardon the ignorance.
> Why/How does d_off play a role in this healing/crawling?

This problem is also present in afr. There are two easy to see problems:

* If multiple readdir requests are needed to get full contents of a 
directory and the brick to which the requests are being sent dies, the 
next readdir request cannot be sent to any other brick because the d_off 
field won't make sense on the other brick. This doesn't have an easy 
solution, so an error is returned instead to complete the directory 
listing. This is odd because in theory we have the directory replicated 
and this should happen (the same scenario but reading from a file is 
handled transparently to the client).

* If you need to detect the differences between the directory contents 
on different bricks (for example when you want to heal it), you will 
need to read full contents of the directory from each brick in memory, 
sort each list, and begin comparison. If that directory contains, for 
example, one million entries, that would need a huge amount of memory 
for an operation that seem more simple. If all bricks would return 
directory entries in the same order and same d_off, this procedure would 
need a lot less of memory and would be more efficient.

>
>>
>> An alternative would be to convert directories into regular files from
>> the brick point of view.
>>
>> The benefits of this would be:
>>
>> * d_off would be controlled by gluster, so all bricks would have the
>> same d_off and order. No need to use any d_off mapping or transformation.
>>
>> * Directories could take advantage of replication and disperse self-heal
>> procedures. They could be treated as files and be healed more easily. A
>> corrupted brick would not produce invalid directory contents, and file
>> duplication in directory listing would be avoided.
>>
>> * Many of the complexities in DHT, AFR and EC to manage directories
>> would be removed.
>>
>> The main issue could be the need of an upper level xlator that would
>> transform directory requests into file modifications and would be
>> responsible of managing all d_off assignment and directory manipulation
>> (renames, links, unlinks, ...).
>
> This is tending towards some thoughts for Gluster 4.0 and specifically
> DHT in 4.0. I am going to wait for the same/similar comments as we
> discuss those specifics (hopefully published before Christmas (2014)).
>


More information about the Gluster-devel mailing list