[Gluster-devel] Performance improvements to Gluster Geo-replication

Tue Sep 1 08:55:15 UTC 2015

Thanks Shyam for your inputs.

regards
Aravinda

On 08/31/2015 07:17 PM, Shyam wrote:
> On 08/31/2015 03:17 AM, Aravinda wrote:
>> Following Changes/ideas identified to improve the Geo-replication
>> Performance. Please add your ideas/issues to the list
>>
>> 1. Entry stime and Data/Meta stime
>> ----------------------------------
>> Now we use only one xattr to maintain the state of sync, called
>> stime. When a Geo-replication worker restarts, it starts from that
>> stime and sync files.
>>
>>      get_changes from <STIME> to <CURRENT TIME>
>>          perform <ENTRY> operations
>>          perform <META> operations
>>          perform <DATA> operations
>>
>> If data operation is failed worker crashes and restarts and reprocess
>> the changelogs again. Entry, Meta and Data operations will be
>> retried. If we maintain entry_stime seperately then we can avoid
>> reprocessing of entry operations which are completed previously.
>
> This seems like a good thing to do.
>
> Here is something more that could be done (I am not well aware of 
> geo-rep internals so maybe this cannot be done),
> - Why not maintain a 'mark', till which even ENTRY/META operations are 
> performed, so that even when failures occur in ENTRY/META operation 
> queue, we need to restart from the mark and not all the way from the 
> beginning STIME.
Changelogs has to be processed from STIME because it will have both 
ENTRY and META, but execution of ENTRY will be ignored if entry_stime is 
ahead of STIME.

>
> I am not sure where such a 'mark' can be maintained, unless the 
> processed get_changes are ordered and written to disk, or ordered 
> idempotently in memory each time.
STIME is maintained as xattr in Master brick root, we can maintain one 
more xattr entry_stime.
>
>>
>>
>> 2. In case of Rsync/Tar failure, do not repeat Entry Operations
>> ---------------------------------------------------------------
>> In case of Rsync/Tar failures, Changelogs are reprocessed
>> again. Instead re trigger only Rsync/Tar job for those list of files
>> which are failed.
>
> (this is more for my understanding)
> I assume that this retry is within the same STIME -> NOW1 period. IOW, 
> if the re-trigger of the tar/rsync is going to occur in the next sync 
> interval, then I would assume that ENTRY/META for NOW1 -> NOW would be 
> repeated, correct? The same is true for the above as well, i.e all 
> ENTRY/META operations that are completed between STIME and NOW1 is not 
> repeated, but events between NOW1 to NOW is, correct?
Syncing files is two step operation. Entry creation with same GFID using 
RPC and Sync Data using Rsync. There is a issue with existing code, 
Entry operations also gets repeated when only data(rsync) failed. (STIME 
-> NOW1)
>
>>
>>
>> 3. Better Rsync Queue
>> ---------------------
>> Now Geo-rep has a Rsync/Tar queue called PostBox. Sync
>> jobs(configurable, default is 3) will empty the Post Box and feeds it
>> to Rsync/Tar process. Second sync job may not find any items to sync,
>> only first job may overloaded. To avoid this, introduce a batch size
>> to PostBox so that each sync jobs gets equal number of files to sync.
>
> Do you want to consider round-robin of entries to the sync jobs, 
> something that we did in rebalance, instead of a batch size?
>
> A batch size can again be consumed by a single sync process, and the 
> next batch by the next one so on. Maybe a round-robin distribution of 
> files to sync from the post-box to each sync thread may help.
Looks like good idea. We need to maintain N number of queues for N sync 
jobs, while adding entry to post box distribute to N queues. Is that right?
>
>>
>>
>> 4. Handling the Tracebacks
>> --------------------------
>> Collect the list of Tracebacks which are not yet handled, and look for
>> posibility of handling it in run time. With this, workers crash will
>> be minimized so that we can avoid initializing and changelogs
>> reprocess efforts.
>>
>>
>> 5. SSH failure handling
>> -----------------------
>> If Slave node goes down, the Master worker connected to it will go to
>> Faulty and restarts. If we can handle SSH failures intelligently, we
>> can reestablish the SSH connection instead of restarting Geo-rep
>> worker. With this change, Active/Passive switch for Network failures
>> can be avoided.
>>
>>
>> 6. On Worker restart, Utilizing Changelogs which are in .processing
>> directory
>> --------------------------------------------------------------------
>> On Worker restart, Start time for Geo-rep is previously updated
>> stime. Geo-rep re-parses the Changelogs from Brick backend to Working
>> directory even though those changelogs parsed previously but stime is
>> not updated due to failures in sync.
>>
>>      1. On Geo-rep restart, Delete all files in .processing/cache and
>>      move all the changelogs available in .processing directory to
>>      .processing/cache
>>      2. In Changelog API, look for Changelog file name in cache before
>>      parsing it.
>>      3. If available in cache, move it to .processing
>>      4. else parse it and generate parsed changelog in .processing
>>
>
> I did not understand the above, but that's probably just me as I am 
> not fully aware of change log process yet :)
To consume the backend Changelogs, Geo-rep registers to Changelog API by 
specifying a working directory. Changelog API will parse the backend 
changelog to specific format understood by Geo-rep and copies to Working 
directory. Geo-rep will consume Changelogs from working directory. In 
each iteration "BACKEND CHANGELOGS -> PARSE TO WORKING DIR -> CONSUME"

During the parse process, Changelog API maintains three directory in 
working directory ".processing", ".processed" and ".current".
.current -> Changelogs before parse
.processing -> Changelogs parsed but not yet consumed by Geo-rep
.processed -> Changelogs consumed and Synced by Geo-rep

If Geo-rep worker restarts, we cleanup .processing directory to prevent 
picking up unexpected changelogs by Geo-rep. So "BACKEND CHANGELOGS -> 
PARSE TO WORKING DIR" is repeated even though parsed data is available 
from previous run.

While replying to this, got another idea to simplify the Changelogs 
processing.

- Do not parse/maintain Changelogs in Working directory, instead just 
maintain the list of Changelog files.
- Expose new Changelog API to parse the changelog. 
libgfchangelog.parse(FILE_PATH, CALLBACK)
- Modify Geo-rep to use this new API when it needs to parse a CHANGELOG 
file.

With this approach, on worker restart only the list of changelog files 
is lost which can be easily regenerated compared to re-parsing Changelogs.