[Gluster-devel] Roadmap for afr, ec

Mon Sep 14 08:30:35 UTC 2015

hi,

Here is a list of common improvements for both ec and afr planned over 
the next few months:

1) Granular entry self-heals.
      Both afr and ec at the moment do lot of readdirs and lookups to 
figure out the differences between the directories to perform heals. 
Kritika, Ravi, Anuradha and I are discussing about how to prevent this. 
The base algo is to store only the names that need heal in 
.glusterfs/indices/entry-changes/<parent-dir-gfid>/ as links to base 
file in .glusterfs/indices/entry-changes of the bricks. So only the 
names that need to be healed will be going through name heals.
We want to complete this for 3.8 definitely.

2) Granular data self-heals.
      At the moment even if a single byte changes in the file afr, ec 
read the entire file to fix the problems. We are thinking of preventing 
this by remembering where the changes happened on the file in extended 
attributes. There will be a new extended attribute on the file which 
represents a bit map of the changes and each bit represents a range that 
needs healing. This extended attribute will have a maximum size it can 
represent, the extra chunks will be represented like shards in 
.glusterfs/indices/data-changes/<gfid-<block-num>> extended attribute on 
this block will store ranges that need heals.

For example: If we have extended attribute value maximum size as 4KB and 
each bit represents 128KB (i.e. first bit represents changes done from 
offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended 
attribute we can store changes happening to file upto 4GB (We are 
thinking of dynamically increasing the size represented by each bit from 
say 4k to 128k, but this is still in design). For changes that are 
happening from offset 4GB+1 - 8GB will be stored in extended attribute 
of .glusterfs/indices/data-changes/<gfid-of-file-1>. Changes happening 
from offset 8GB+1 to 12GB will be stored in extended attribute of 
.glusterfs/indices/data-changes/<gfid-of-file-2>, (please note that 
these files are empty, they will just contain extended attributes) etc.
We want to complete this for 3.8 (stretch goal)

3) Performance & throttling improvements for self-heal:
      We are also looking into the multi-threaded self-heal daemon patch 
by Richard for inclusion in 3.8. We are waiting for the discussions by 
Raghavendra G on QoS to be over before coming to any decisions on 
throttling.

After we have compound fops:
Goal here is to come up with compound fops and prevent un-necessary 
round trips:
4) Transaction latency improvements:
      On afr:
       In the unoptimized version of transaction we have: 1) Lock, 2) 
Pre-op 3) op 4) Post-op 5) unlock
       We will 
have:                                                             1) 
Lock, 2) Pre-op + op 3) post-op + unlock
        This reduces round trips from 5 to 3 in the un-optimized version 
of afr-transaction.
      On EC:
       In the unoptimized version (worst case of unaligned write) of 
transaction we have: 1) Lock, 2) get version, size xattrs 3) reads of 
pre, post unaligned chunks 4) op 5) update version, size 6) unlock
       We will 
have:                                                             1) 
Lock + get version, size xattrs + reads of pre, post unaligned chunks, 
2) op  3) update version, size + unlock
        This reduces round trips from 6 to 3 in the un-optimized version 
of ec-transaction.

5) Entry self-heal per name latency improvements:
     Before: 1) Lock, 2) lookup to determine if the file needs to be 
deleted/created 3) create/delete 4) Unlock
     After: 1) Lock + lookup 2) delete/create + unlock

Roadmap that applies only for EC: for 3.8
- Use SSE2/AVX/NEON extensions when available to speed up Galois Field 
calculations
- Use a systematic matrix to improve encoding performance (it will also 
improve decoding performance when all bricks are healthy)
- Implement a new algorithm able to detect and repair chunks of data on 
the fly.

Roadmap that applies only for AFR:
1) Once granular entry/data heals, throttling are in, we can look at 
generalizing Richard's lazy replication patch to be used for Near 
synchronous replication between data centers and possibly just the 
bricks, haven't looked into the patch myself.

We will be sending out more mails as soon as design completes for each 
of these items. We are eagerly waiting for Xavi to come back to get his 
comments as well for how EC will be impacted by the common changes. 
Feedback on this plan is very welcome!

Pranith