[Gluster-devel] Roadmap for afr, ec

Fri Sep 18 05:25:30 UTC 2015


On 09/16/2015 03:42 PM, fanghuang.data at yahoo.com wrote:
> Hi Pranith,
>
> For the EC encoding/decoding algorithm, could we design a plug-in mechanism to make users can choose their own
> algorithm or can use the third side library just like Ceph? And I am also curious why originally the IDA algorithm
> is chosen, instead of the common used Reed-Solomon algorithm?
Pluggability of algorithms is also in plan. I never really bothered to 
check which algorithm was used, and was under the impression that we are 
using reed-solomon nonsystematic erasure codes as told to me by Dan(CCed).

Pranith
>   
> Best Regards,
> Fang Huang
>
>
>> On Monday, 14 September 2015, 16:30, Pranith Kumar Karampuri <pkarampu at redhat.com> wrote:
>>> hi,
>> Here is a list of common improvements for both ec and afr planned over
>> the next few months:
>>
>> 1) Granular entry self-heals.
>>        Both afr and ec at the moment do lot of readdirs and lookups to
>> figure out the differences between the directories to perform heals.
>> Kritika, Ravi, Anuradha and I are discussing about how to prevent this.
>> The base algo is to store only the names that need heal in
>> .glusterfs/indices/entry-changes/<parent-dir-gfid>/ as links to base
>> file in .glusterfs/indices/entry-changes of the bricks. So only the
>> names that need to be healed will be going through name heals.
>> We want to complete this for 3.8 definitely.
>>
>> 2) Granular data self-heals.
>>        At the moment even if a single byte changes in the file afr, ec
>> read the entire file to fix the problems. We are thinking of preventing
>> this by remembering where the changes happened on the file in extended
>> attributes. There will be a new extended attribute on the file which
>> represents a bit map of the changes and each bit represents a range that
>> needs healing. This extended attribute will have a maximum size it can
>> represent, the extra chunks will be represented like shards in
>> .glusterfs/indices/data-changes/<gfid-<block-num>> extended
>> attribute on
>> this block will store ranges that need heals.
>>
>> For example: If we have extended attribute value maximum size as 4KB and
>> each bit represents 128KB (i.e. first bit represents changes done from
>> offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended
>> attribute we can store changes happening to file upto 4GB (We are
>> thinking of dynamically increasing the size represented by each bit from
>> say 4k to 128k, but this is still in design). For changes that are
>> happening from offset 4GB+1 - 8GB will be stored in extended attribute
>> of .glusterfs/indices/data-changes/<gfid-of-file-1>. Changes happening
>> from offset 8GB+1 to 12GB will be stored in extended attribute of
>> .glusterfs/indices/data-changes/<gfid-of-file-2>, (please note that
>> these files are empty, they will just contain extended attributes) etc.
>> We want to complete this for 3.8 (stretch goal)
>>
>> 3) Performance & throttling improvements for self-heal:
>>        We are also looking into the multi-threaded self-heal daemon patch
>> by Richard for inclusion in 3.8. We are waiting for the discussions by
>> Raghavendra G on QoS to be over before coming to any decisions on
>> throttling.
>>
>> After we have compound fops:
>> Goal here is to come up with compound fops and prevent un-necessary
>> round trips:
>> 4) Transaction latency improvements:
>>        On afr:
>>         In the unoptimized version of transaction we have: 1) Lock, 2)
>> Pre-op 3) op 4) Post-op 5) unlock
>>         We will
>> have:                                                             1)
>> Lock, 2) Pre-op + op 3) post-op + unlock
>>          This reduces round trips from 5 to 3 in the un-optimized version
>> of afr-transaction.
>>        On EC:
>>         In the unoptimized version (worst case of unaligned write) of
>> transaction we have: 1) Lock, 2) get version, size xattrs 3) reads of
>> pre, post unaligned chunks 4) op 5) update version, size 6) unlock
>>         We will
>> have:                                                             1)
>> Lock + get version, size xattrs + reads of pre, post unaligned chunks,
>> 2) op  3) update version, size + unlock
>>          This reduces round trips from 6 to 3 in the un-optimized version
>> of ec-transaction.
>>
>> 5) Entry self-heal per name latency improvements:
>>       Before: 1) Lock, 2) lookup to determine if the file needs to be
>> deleted/created 3) create/delete 4) Unlock
>>       After: 1) Lock + lookup 2) delete/create + unlock
>>
>> Roadmap that applies only for EC: for 3.8
>> - Use SSE2/AVX/NEON extensions when available to speed up Galois Field
>> calculations
>> - Use a systematic matrix to improve encoding performance (it will also
>> improve decoding performance when all bricks are healthy)
>> - Implement a new algorithm able to detect and repair chunks of data on
>> the fly.
>>
>> Roadmap that applies only for AFR:
>> 1) Once granular entry/data heals, throttling are in, we can look at
>> generalizing Richard's lazy replication patch to be used for Near
>> synchronous replication between data centers and possibly just the
>> bricks, haven't looked into the patch myself.
>>
>> We will be sending out more mails as soon as design completes for each
>> of these items. We are eagerly waiting for Xavi to come back to get his
>> comments as well for how EC will be impacted by the common changes.
>> Feedback on this plan is very welcome!
>>
>> Pranith
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>