[Gluster-devel] Handling Georeplication rsync/tar+ssh failures more accurately.

Mon Jan 19 08:52:34 UTC 2015

Handling Geo-replication rsync/tar+ssh failures more accurately.
================================================================

Existing:
---------
1. Multiple Changelogs processed together, contents are segregated into 
ENTRY, META and DATA.
2. All Entry and Meta operations are sent to Slave gsyncd via RPC. Entry 
and Meta Ops are not parallel, executed in Slave sequentially.
3. For Data operations, GFIDs are queued and multiple rsync jobs(default 
is 3) sync data parallelly. (Since all entry available with previous 
step). These rsync jobs do not have any idea about to which changelog a 
GFID belong to.

If rsync/tar+ssh fails to sync then it retries all the steps mentioned 
above, after MAX retries it adds whole changelog(s) to Skipped 
List(since it is partially processed).

Problems:
---------
No clue about Entry Ops failures in log/status
For rsync failures, all the steps are repeated even though not necessary.
Skips entire changelog even if one GFID has issue.
No way to get accurate list of failed GFIDs/files.

Planned enhancements:
---------------------
- Log Entry Ops failures in log in following format and show in Status 
output.(separate log file, say 
/var/log/glusterfs/geo-replication/<MASTER>_<SLAVE_HOST>_<SLAVE>/failures) 
(Log format: GFID|Changelog|Reason|Details)

    For example, 0d5fd80f-e5b5-4a9a-9023-879d730c9b82|1421648492|File 
exists with different GFID|E 57bad16c-222c-4c5e-80a8-87f77ffc9284 CREATE 
33188 0 0 00000000-0000-0000-0000-000000000001/f1

- Make sure to remove the GFID from DATA list, if any Unlink captured in 
Changelog for which DATA is also recorded. This avoids rsync failure for 
these GFID's(rsync will fail because source file is Unlinked)

- Create a Unique list of GFIDs for rsync. (rsync will get benefited, 
avoids sending duplicate list to rsync)

- When Rsync fails, do not repeat steps 1 and 2, Retry only step 3.
     1. FIRST RETRY: Stat in Master mount, rsync only for valid stat GFIDs.
     2. SECOND RETRY: Retry first GFID separately and rest of them all 
at once, If First GFID fails add to skip list.
        If rest of the batch fails again, then do the same thing again. 
(Rsync first in the batch separately and rsync for rest of the batch)
        Repeat this step till all GFIDs get processed.(Either in skipped 
list or Success)

SECOND RETRY approach may affect geo-rep performance, but only when 
their is rsync problem.

If any failures, log in Failures log and show the number in the Status 
output.(Ex: b340deb7-8dd2-4d10-ab26-80acd3ff4954|1421648492|I/O Error|rsync)

Let me know your thoughts. Thanks

--
regards
Aravinda