[Gluster-devel] Handling Georeplication rsync/tar+ssh failures more accurately.
Aravinda
avishwan at redhat.com
Mon Jan 19 08:52:34 UTC 2015
Handling Geo-replication rsync/tar+ssh failures more accurately.
================================================================
Existing:
---------
1. Multiple Changelogs processed together, contents are segregated into
ENTRY, META and DATA.
2. All Entry and Meta operations are sent to Slave gsyncd via RPC. Entry
and Meta Ops are not parallel, executed in Slave sequentially.
3. For Data operations, GFIDs are queued and multiple rsync jobs(default
is 3) sync data parallelly. (Since all entry available with previous
step). These rsync jobs do not have any idea about to which changelog a
GFID belong to.
If rsync/tar+ssh fails to sync then it retries all the steps mentioned
above, after MAX retries it adds whole changelog(s) to Skipped
List(since it is partially processed).
Problems:
---------
No clue about Entry Ops failures in log/status
For rsync failures, all the steps are repeated even though not necessary.
Skips entire changelog even if one GFID has issue.
No way to get accurate list of failed GFIDs/files.
Planned enhancements:
---------------------
- Log Entry Ops failures in log in following format and show in Status
output.(separate log file, say
/var/log/glusterfs/geo-replication/<MASTER>_<SLAVE_HOST>_<SLAVE>/failures)
(Log format: GFID|Changelog|Reason|Details)
For example, 0d5fd80f-e5b5-4a9a-9023-879d730c9b82|1421648492|File
exists with different GFID|E 57bad16c-222c-4c5e-80a8-87f77ffc9284 CREATE
33188 0 0 00000000-0000-0000-0000-000000000001/f1
- Make sure to remove the GFID from DATA list, if any Unlink captured in
Changelog for which DATA is also recorded. This avoids rsync failure for
these GFID's(rsync will fail because source file is Unlinked)
- Create a Unique list of GFIDs for rsync. (rsync will get benefited,
avoids sending duplicate list to rsync)
- When Rsync fails, do not repeat steps 1 and 2, Retry only step 3.
1. FIRST RETRY: Stat in Master mount, rsync only for valid stat GFIDs.
2. SECOND RETRY: Retry first GFID separately and rest of them all
at once, If First GFID fails add to skip list.
If rest of the batch fails again, then do the same thing again.
(Rsync first in the batch separately and rsync for rest of the batch)
Repeat this step till all GFIDs get processed.(Either in skipped
list or Success)
SECOND RETRY approach may affect geo-rep performance, but only when
their is rsync problem.
If any failures, log in Failures log and show the number in the Status
output.(Ex: b340deb7-8dd2-4d10-ab26-80acd3ff4954|1421648492|I/O Error|rsync)
Let me know your thoughts. Thanks
--
regards
Aravinda
More information about the Gluster-devel
mailing list