[Gluster-devel] Geo-replication: Improving the performance during History Crawl
Vijay Bellur
vbellur at redhat.com
Mon Aug 8 20:08:49 UTC 2016
On 08/05/2016 04:45 AM, Aravinda wrote:
> Hi,
>
> Geo-replication has three types of Change detection(To identify the list
> of files changed and to sync only those files)
>
> 1. XTime based Brick backend Crawl for initial sync
> 2. Historical Changelogs to sync backlogs(Files created/modified/deleted
> between Worker down and start)
> 3. Live Changelogs - As and when changelog is rolled over, process it
> and sync the changes
>
> If initial data available in Master Volume before Geo-replication
> session is created, then it does XTime based Crawl(Hybrid Crawl) and
> then switches to Live Changelog mode.
> After initial sync, Xtime crawl will not be used. On worker restart it
> uses Historical changelogs and then switches to Live Changelogs.
>
> Geo-replication is very slow during History Crawl if backlog changelogs
> grows(If Geo-rep session was down for long time).
>
Do we need an upper bound on the duration allowed for the backlog
changelog to grow? If the backlog grows beyond a certain threshold,
should we resort to xtime based crawl as in the initial sync?
> - If a same file is Created, deleted and again created, Geo-rep is
> replaying the changelogs in the same manner in Slave side.
> - Data sync happens GFID to GFID, So except the final GFID sync all the
> other sync will fail since file not exists in Master(File may exist but
> with different GFID)
> Due to these data sync and retries, Geo-rep performance is affected.
>
> Me and Kotresh discussed about the same and came up with following
> changes to Geo-replication
>
> While processing History,
>
> - Collect all the entry, data and meta operations in a temporary database
Depending on the number of changelogs and operations, creation of this
database itself might take a non trivial amount of time. If there is an
archival/WORM workload without any deletions, would this step be counter
productive from a performance perspective?
> - Delete all Data and Meta GFIDs which are already unlinked as per
> Changelogs
We need to delete only those GFIDs whose link count happens to be zero
after the unlink. Would this need an additional stat()?
> - Process all Entry operations in batch
> - Process data and meta operations in batch
> - Once the sync is complete, Update last Changelog's time as last_synced
> time as usual.
>
> Challenges:
> - If worker crashes in between while doing above steps, on restart same
> changelogs will be reprocessed.(Crawl done in small batches in existing,
> so on failure reprocess only last partially completed last batch)
> Some of the retries can be avoided if we start maintaining details
> about entry_last_synced(entry_stime) and data_last_synced(stime)
> separately.
>
Right, this can be a significant challenge if we keep crashing at the
same point due to an external factor or a bug in code. Having a more
granular tracker can help in reducing the cost of a retry.
-Vijay
More information about the Gluster-devel
mailing list