[Gluster-devel] Improving Geo-replication Status and Checkpoints

Sahina Bose sabose at redhat.com
Mon Feb 2 07:25:13 UTC 2015


On 01/28/2015 04:07 PM, Aravinda wrote:
> Background
> ----------
> We have `status` and `status detail` commands for GlusterFS 
> geo-replication, This mail is to fix the existing issues in these 
> command outputs. Let us know if we need any other columns which helps 
> users to get meaningful status.
>
> Existing output
> ---------------
> Status command output
>     MASTER NODE - Master node hostname/IP
>     MASTER VOL - Master volume name
>     MASTER BRICK - Master brick path
>     SLAVE - Slave host and Volume name(HOST::VOL format)
>     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>     CHECKPOINT STATUS - Details about Checkpoint completion
>     CRAWL STATUS - Hybrid/History/Changelog
>
> Status detail -
>     MASTER NODE - Master node hostname/IP
>     MASTER VOL - Master volume name
>     MASTER BRICK - Master brick path
>     SLAVE - Slave host and Volume name(HOST::VOL format)
>     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>     CHECKPOINT STATUS - Details about Checkpoint completion
>     CRAWL STATUS - Hybrid/History/Changelog
>     FILES SYNCD - Number of Files Synced
>     FILES PENDING - Number of Files Pending
>     BYTES PENDING - Bytes pending
>     DELETES PENDING - Number of Deletes Pending
>     FILES SKIPPED - Number of Files skipped
>
>
> Issues with existing status and status detail:
> ----------------------------------------------
>
> 1. Active/Passive and Stable/faulty status is mixed up - Same column 
> is used to show both active/passive status as well as Stable/faulty 
> status. If Active node goes faulty then by looking at the status it is 
> difficult to understand Active node is faulty or the passive one.
> 2. Info about last synced time, unless we set checkpoint it is 
> difficult to understand till what time data is synced to slave. For 
> example, if a admin want's to know all the files synced which are 
> created 15 mins ago, it is not possible without setting checkpoint.
> 3. Wrong values in metrics.
> 4. When multiple bricks present in same node. Status shows Faulty when 
> one of the worker is faulty in that node.
>
> Changes:
> --------
> 1. Active nodes will be prefixed with * to identify it is a active 
> node.(In xml output active tag will be introduced with values 0 or 1)
> 2. New column will show the last synced time, which minimizes the use 
> of checkpoint feature. Checkpoint status will be shown only in status 
> detail.
> 3. Checkpoint Status is removed, Separate Checkpoint command will be 
> added to gluster cli(We can introduce multiple Checkpoint feature with 
> this change)
> 4. Status values will be "Not 
> Started/Initializing/Started/Faulty/Stopped". Stable is changed to 
> "Started"
> 5. Slave User column will be introduced to show to which user geo-rep 
> session is established.(Useful in Non root geo-rep)
> 6. Bytes pending column will be removed. It is not possible to 
> identify the delta without simulating sync. For example, we are using 
> rsync to sync data from master to slave, If we need to know how much 
> data to be transferred then we have to run the rsync command with 
> --dry-run flag before running actual command. With tar-ssh we have to 
> stat all the files which are identified to be synced to calculate the 
> total bytes to be synced. Both are costly operations which degrades 
> the geo-rep performance.(In Future we can include these columns)
> 7. Files pending, Synced, deletes pending are only session information 
> of the worker, these numbers will not match with the number of files 
> present in Filesystem. If worker restarts, counter will reset to zero. 
> When worker restarts, it logs previous session stats before resetting it.
> 8. Files Skipped is persistent status across sessions, Shows exact 
> count of number of files skipped(Can get list of GFIDs skipped from 
> log file)
> 9. "Deletes Pending" column can be removed?

Is there any way to know if there are errors syncing any of the files? 
Which column would that reflect in?
Is the last synced time - the least of the synced time across the nodes?


>
> Example output
>
>     MASTER NODE  MASTER VOL  MASTER BRICK  SLAVE USER 
> SLAVE             STATUS    LAST SYNCED           CRAWL
> ---------------------------------------------------------------------------------------------------------------- 
>
>     * fedoravm1  gvm         /gfs/b1       root fedoravm3::gvs 
> Started   2014-05-10 03:07 pm   Changelog
>       fedoravm2  gvm         /gfs/b2       root fedoravm4::gvs 
> Started   2014-05-10 03:07 pm   Changelog
>
> New Status columns
>
>     ACTIVE_PASSIVE - * if Active else none.
>     MASTER NODE - Master node hostname/IP
>     MASTER VOL - Master volume name
>     MASTER BRICK - Master brick path
>     SLAVE USER - Slave user to which geo-rep is established.
>     SLAVE - Slave host and Volume name(HOST::VOL format)
>     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>     LAST SYNCED - Last synced time(Based on stime xattr)
>     CHECKPOINT STATUS - Details about Checkpoint completion
>     CRAWL STATUS - Hybrid/History/Changelog
>     FILES SYNCD - Number of Files Synced
>     FILES PENDING - Number of Files Pending
>     DELETES PENDING- Number of Deletes Pending
>     FILES SKIPPED - Number of Files skipped
>
>
> XML output
>     active
>     master_node
>     master_node_uuid
>     master_brick
>     slave_user
>     slave
>     status
>     last_synced
>     crawl_status
>     files_syncd
>     files_pending
>     deletes_pending
>     files_skipped
>
>
> Checkpoints
> ===========
> New set of Gluster CLI commands will be introduced for Checkpoints.
>
>     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
> checkpoint create <NAME> <DATE>
> gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
> checkpoint delete <NAME>
>     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
> checkpoint delete all
>     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
> checkpoint status [<NAME>]
>     gluster volume geo-replication <VOLNAME> checkpoint status # For 
> all geo-rep sessions for that volume
>     gluster volume geo-replication checkpoint status # For all geo-rep 
> sessions for all volumes
>
>
> Checkpoint Status:
>
>     SESSION                    NAME      Completed    Checkpoint 
> Time        Completion Time
> ----------------------------------------------------------------------------------------- 
>
>     gvm->root at fedoravm3::gvs   Chk1      Yes 2014-11-30 11:30 pm    
> 2014-12-01 02:30 pm
>     gvm->root at fedoravm3::gvs   Chk2      No 2014-12-01 10:00 pm    N/A

Can the time information have the timezone information as well? Or is 
this UTC time?
(Same comment for last synced time)

>
> XML output:
>     session
>     master_uuid
>     name
>     completed
>     checkpoint_time
>     completion_time
>
>
> -- 
> regards
> Aravinda
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel



More information about the Gluster-devel mailing list