[Gluster-devel] Performance improvements to Gluster Geo-replication

Mon Aug 31 13:47:55 UTC 2015

On 08/31/2015 03:17 AM, Aravinda wrote:
> Following Changes/ideas identified to improve the Geo-replication
> Performance. Please add your ideas/issues to the list
>
> 1. Entry stime and Data/Meta stime
> ----------------------------------
> Now we use only one xattr to maintain the state of sync, called
> stime. When a Geo-replication worker restarts, it starts from that
> stime and sync files.
>
>      get_changes from <STIME> to <CURRENT TIME>
>          perform <ENTRY> operations
>          perform <META> operations
>          perform <DATA> operations
>
> If data operation is failed worker crashes and restarts and reprocess
> the changelogs again. Entry, Meta and Data operations will be
> retried. If we maintain entry_stime seperately then we can avoid
> reprocessing of entry operations which are completed previously.

This seems like a good thing to do.

Here is something more that could be done (I am not well aware of 
geo-rep internals so maybe this cannot be done),
- Why not maintain a 'mark', till which even ENTRY/META operations are 
performed, so that even when failures occur in ENTRY/META operation 
queue, we need to restart from the mark and not all the way from the 
beginning STIME.

I am not sure where such a 'mark' can be maintained, unless the 
processed get_changes are ordered and written to disk, or ordered 
idempotently in memory each time.

>
>
> 2. In case of Rsync/Tar failure, do not repeat Entry Operations
> ---------------------------------------------------------------
> In case of Rsync/Tar failures, Changelogs are reprocessed
> again. Instead re trigger only Rsync/Tar job for those list of files
> which are failed.

(this is more for my understanding)
I assume that this retry is within the same STIME -> NOW1 period. IOW, 
if the re-trigger of the tar/rsync is going to occur in the next sync 
interval, then I would assume that ENTRY/META for NOW1 -> NOW would be 
repeated, correct? The same is true for the above as well, i.e all 
ENTRY/META operations that are completed between STIME and NOW1 is not 
repeated, but events between NOW1 to NOW is, correct?

>
>
> 3. Better Rsync Queue
> ---------------------
> Now Geo-rep has a Rsync/Tar queue called PostBox. Sync
> jobs(configurable, default is 3) will empty the Post Box and feeds it
> to Rsync/Tar process. Second sync job may not find any items to sync,
> only first job may overloaded. To avoid this, introduce a batch size
> to PostBox so that each sync jobs gets equal number of files to sync.

Do you want to consider round-robin of entries to the sync jobs, 
something that we did in rebalance, instead of a batch size?

A batch size can again be consumed by a single sync process, and the 
next batch by the next one so on. Maybe a round-robin distribution of 
files to sync from the post-box to each sync thread may help.

>
>
> 4. Handling the Tracebacks
> --------------------------
> Collect the list of Tracebacks which are not yet handled, and look for
> posibility of handling it in run time. With this, workers crash will
> be minimized so that we can avoid initializing and changelogs
> reprocess efforts.
>
>
> 5. SSH failure handling
> -----------------------
> If Slave node goes down, the Master worker connected to it will go to
> Faulty and restarts. If we can handle SSH failures intelligently, we
> can reestablish the SSH connection instead of restarting Geo-rep
> worker. With this change, Active/Passive switch for Network failures
> can be avoided.
>
>
> 6. On Worker restart, Utilizing Changelogs which are in .processing
> directory
> --------------------------------------------------------------------
> On Worker restart, Start time for Geo-rep is previously updated
> stime. Geo-rep re-parses the Changelogs from Brick backend to Working
> directory even though those changelogs parsed previously but stime is
> not updated due to failures in sync.
>
>      1. On Geo-rep restart, Delete all files in .processing/cache and
>      move all the changelogs available in .processing directory to
>      .processing/cache
>      2. In Changelog API, look for Changelog file name in cache before
>      parsing it.
>      3. If available in cache, move it to .processing
>      4. else parse it and generate parsed changelog in .processing
>

I did not understand the above, but that's probably just me as I am not 
fully aware of change log process yet :)