[Gluster-devel] Improving Geo-replication Status and Checkpoints

Mon Feb 2 11:21:06 UTC 2015

Thanks Sahina, replied inline.

--
regards
Aravinda

On 02/02/2015 12:55 PM, Sahina Bose wrote:
>
> On 01/28/2015 04:07 PM, Aravinda wrote:
>> Background
>> ----------
>> We have `status` and `status detail` commands for GlusterFS 
>> geo-replication, This mail is to fix the existing issues in these 
>> command outputs. Let us know if we need any other columns which helps 
>> users to get meaningful status.
>>
>> Existing output
>> ---------------
>> Status command output
>>     MASTER NODE - Master node hostname/IP
>>     MASTER VOL - Master volume name
>>     MASTER BRICK - Master brick path
>>     SLAVE - Slave host and Volume name(HOST::VOL format)
>>     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>>     CHECKPOINT STATUS - Details about Checkpoint completion
>>     CRAWL STATUS - Hybrid/History/Changelog
>>
>> Status detail -
>>     MASTER NODE - Master node hostname/IP
>>     MASTER VOL - Master volume name
>>     MASTER BRICK - Master brick path
>>     SLAVE - Slave host and Volume name(HOST::VOL format)
>>     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>>     CHECKPOINT STATUS - Details about Checkpoint completion
>>     CRAWL STATUS - Hybrid/History/Changelog
>>     FILES SYNCD - Number of Files Synced
>>     FILES PENDING - Number of Files Pending
>>     BYTES PENDING - Bytes pending
>>     DELETES PENDING - Number of Deletes Pending
>>     FILES SKIPPED - Number of Files skipped
>>
>>
>> Issues with existing status and status detail:
>> ----------------------------------------------
>>
>> 1. Active/Passive and Stable/faulty status is mixed up - Same column 
>> is used to show both active/passive status as well as Stable/faulty 
>> status. If Active node goes faulty then by looking at the status it 
>> is difficult to understand Active node is faulty or the passive one.
>> 2. Info about last synced time, unless we set checkpoint it is 
>> difficult to understand till what time data is synced to slave. For 
>> example, if a admin want's to know all the files synced which are 
>> created 15 mins ago, it is not possible without setting checkpoint.
>> 3. Wrong values in metrics.
>> 4. When multiple bricks present in same node. Status shows Faulty 
>> when one of the worker is faulty in that node.
>>
>> Changes:
>> --------
>> 1. Active nodes will be prefixed with * to identify it is a active 
>> node.(In xml output active tag will be introduced with values 0 or 1)
>> 2. New column will show the last synced time, which minimizes the use 
>> of checkpoint feature. Checkpoint status will be shown only in status 
>> detail.
>> 3. Checkpoint Status is removed, Separate Checkpoint command will be 
>> added to gluster cli(We can introduce multiple Checkpoint feature 
>> with this change)
>> 4. Status values will be "Not 
>> Started/Initializing/Started/Faulty/Stopped". Stable is changed to 
>> "Started"
>> 5. Slave User column will be introduced to show to which user geo-rep 
>> session is established.(Useful in Non root geo-rep)
>> 6. Bytes pending column will be removed. It is not possible to 
>> identify the delta without simulating sync. For example, we are using 
>> rsync to sync data from master to slave, If we need to know how much 
>> data to be transferred then we have to run the rsync command with 
>> --dry-run flag before running actual command. With tar-ssh we have to 
>> stat all the files which are identified to be synced to calculate the 
>> total bytes to be synced. Both are costly operations which degrades 
>> the geo-rep performance.(In Future we can include these columns)
>> 7. Files pending, Synced, deletes pending are only session 
>> information of the worker, these numbers will not match with the 
>> number of files present in Filesystem. If worker restarts, counter 
>> will reset to zero. When worker restarts, it logs previous session 
>> stats before resetting it.
>> 8. Files Skipped is persistent status across sessions, Shows exact 
>> count of number of files skipped(Can get list of GFIDs skipped from 
>> log file)
>> 9. "Deletes Pending" column can be removed?
>
> Is there any way to know if there are errors syncing any of the files? 
> Which column would that reflect in?
"Skipped" Column shows number of files failed to sync to Slave.

> Is the last synced time - the least of the synced time across the nodes?
Status output will have one entry for each brick, so we are planning to 
display last synced time from that brick.
>
>
>>
>> Example output
>>
>>     MASTER NODE  MASTER VOL  MASTER BRICK  SLAVE USER 
>> SLAVE             STATUS    LAST SYNCED           CRAWL
>> ---------------------------------------------------------------------------------------------------------------- 
>>
>>     * fedoravm1  gvm         /gfs/b1       root fedoravm3::gvs 
>> Started   2014-05-10 03:07 pm   Changelog
>>       fedoravm2  gvm         /gfs/b2       root fedoravm4::gvs 
>> Started   2014-05-10 03:07 pm   Changelog
>>
>> New Status columns
>>
>>     ACTIVE_PASSIVE - * if Active else none.
>>     MASTER NODE - Master node hostname/IP
>>     MASTER VOL - Master volume name
>>     MASTER BRICK - Master brick path
>>     SLAVE USER - Slave user to which geo-rep is established.
>>     SLAVE - Slave host and Volume name(HOST::VOL format)
>>     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>>     LAST SYNCED - Last synced time(Based on stime xattr)
>>     CHECKPOINT STATUS - Details about Checkpoint completion
>>     CRAWL STATUS - Hybrid/History/Changelog
>>     FILES SYNCD - Number of Files Synced
>>     FILES PENDING - Number of Files Pending
>>     DELETES PENDING- Number of Deletes Pending
>>     FILES SKIPPED - Number of Files skipped
>>
>>
>> XML output
>>     active
>>     master_node
>>     master_node_uuid
>>     master_brick
>>     slave_user
>>     slave
>>     status
>>     last_synced
>>     crawl_status
>>     files_syncd
>>     files_pending
>>     deletes_pending
>>     files_skipped
>>
>>
>> Checkpoints
>> ===========
>> New set of Gluster CLI commands will be introduced for Checkpoints.
>>
>>     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
>> checkpoint create <NAME> <DATE>
>> gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
>> checkpoint delete <NAME>
>>     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
>> checkpoint delete all
>>     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
>> checkpoint status [<NAME>]
>>     gluster volume geo-replication <VOLNAME> checkpoint status # For 
>> all geo-rep sessions for that volume
>>     gluster volume geo-replication checkpoint status # For all 
>> geo-rep sessions for all volumes
>>
>>
>> Checkpoint Status:
>>
>>     SESSION                    NAME      Completed    Checkpoint 
>> Time        Completion Time
>> ----------------------------------------------------------------------------------------- 
>>
>>     gvm->root at fedoravm3::gvs   Chk1      Yes 2014-11-30 11:30 pm    
>> 2014-12-01 02:30 pm
>>     gvm->root at fedoravm3::gvs   Chk2      No 2014-12-01 10:00 pm    N/A
>
> Can the time information have the timezone information as well? Or is 
> this UTC time?
> (Same comment for last synced time)
Sure. Will have UTC time in Status output.
>
>>
>> XML output:
>>     session
>>     master_uuid
>>     name
>>     completed
>>     checkpoint_time
>>     completion_time
>>
>>
>> -- 
>> regards
>> Aravinda
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>