[Gluster-devel] Roadmap for afr, ec

Wed Sep 16 10:12:19 UTC 2015

Hi Pranith,

For the EC encoding/decoding algorithm, could we design a plug-in mechanism to make users can choose their own 
algorithm or can use the third side library just like Ceph? And I am also curious why originally the IDA algorithm 
is chosen, instead of the common used Reed-Solomon algorithm?

Best Regards,
Fang Huang

> On Monday, 14 September 2015, 16:30, Pranith Kumar Karampuri <pkarampu at redhat.com> wrote:
> > hi,
> 
> Here is a list of common improvements for both ec and afr planned over 
> the next few months:
> 
> 1) Granular entry self-heals.
>       Both afr and ec at the moment do lot of readdirs and lookups to 
> figure out the differences between the directories to perform heals. 
> Kritika, Ravi, Anuradha and I are discussing about how to prevent this. 
> The base algo is to store only the names that need heal in 
> .glusterfs/indices/entry-changes/<parent-dir-gfid>/ as links to base 
> file in .glusterfs/indices/entry-changes of the bricks. So only the 
> names that need to be healed will be going through name heals.
> We want to complete this for 3.8 definitely.
> 
> 2) Granular data self-heals.
>       At the moment even if a single byte changes in the file afr, ec 
> read the entire file to fix the problems. We are thinking of preventing 
> this by remembering where the changes happened on the file in extended 
> attributes. There will be a new extended attribute on the file which 
> represents a bit map of the changes and each bit represents a range that 
> needs healing. This extended attribute will have a maximum size it can 
> represent, the extra chunks will be represented like shards in 
> .glusterfs/indices/data-changes/<gfid-<block-num>> extended 
> attribute on 
> this block will store ranges that need heals.
> 
> For example: If we have extended attribute value maximum size as 4KB and 
> each bit represents 128KB (i.e. first bit represents changes done from 
> offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended 
> attribute we can store changes happening to file upto 4GB (We are 
> thinking of dynamically increasing the size represented by each bit from 
> say 4k to 128k, but this is still in design). For changes that are 
> happening from offset 4GB+1 - 8GB will be stored in extended attribute 
> of .glusterfs/indices/data-changes/<gfid-of-file-1>. Changes happening 
> from offset 8GB+1 to 12GB will be stored in extended attribute of 
> .glusterfs/indices/data-changes/<gfid-of-file-2>, (please note that 
> these files are empty, they will just contain extended attributes) etc.
> We want to complete this for 3.8 (stretch goal)
> 
> 3) Performance & throttling improvements for self-heal:
>       We are also looking into the multi-threaded self-heal daemon patch 
> by Richard for inclusion in 3.8. We are waiting for the discussions by 
> Raghavendra G on QoS to be over before coming to any decisions on 
> throttling.
> 
> After we have compound fops:
> Goal here is to come up with compound fops and prevent un-necessary 
> round trips:
> 4) Transaction latency improvements:
>       On afr:
>        In the unoptimized version of transaction we have: 1) Lock, 2) 
> Pre-op 3) op 4) Post-op 5) unlock
>        We will 
> have:                                                             1) 
> Lock, 2) Pre-op + op 3) post-op + unlock
>         This reduces round trips from 5 to 3 in the un-optimized version 
> of afr-transaction.
>       On EC:
>        In the unoptimized version (worst case of unaligned write) of 
> transaction we have: 1) Lock, 2) get version, size xattrs 3) reads of 
> pre, post unaligned chunks 4) op 5) update version, size 6) unlock
>        We will 
> have:                                                             1) 
> Lock + get version, size xattrs + reads of pre, post unaligned chunks, 
> 2) op  3) update version, size + unlock
>         This reduces round trips from 6 to 3 in the un-optimized version 
> of ec-transaction.
> 
> 5) Entry self-heal per name latency improvements:
>      Before: 1) Lock, 2) lookup to determine if the file needs to be 
> deleted/created 3) create/delete 4) Unlock
>      After: 1) Lock + lookup 2) delete/create + unlock
> 
> Roadmap that applies only for EC: for 3.8
> - Use SSE2/AVX/NEON extensions when available to speed up Galois Field 
> calculations
> - Use a systematic matrix to improve encoding performance (it will also 
> improve decoding performance when all bricks are healthy)
> - Implement a new algorithm able to detect and repair chunks of data on 
> the fly.
> 
> Roadmap that applies only for AFR:
> 1) Once granular entry/data heals, throttling are in, we can look at 
> generalizing Richard's lazy replication patch to be used for Near 
> synchronous replication between data centers and possibly just the 
> bricks, haven't looked into the patch myself.
> 
> We will be sending out more mails as soon as design completes for each 
> of these items. We are eagerly waiting for Xavi to come back to get his 
> comments as well for how EC will be impacted by the common changes. 
> Feedback on this plan is very welcome!
> 
> Pranith
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>