[Gluster-devel] Improving Geo-replication Status and Checkpoints

Aravinda avishwan at redhat.com
Wed Jan 28 10:37:21 UTC 2015


Background
----------
We have `status` and `status detail` commands for GlusterFS 
geo-replication, This mail is to fix the existing issues in these 
command outputs. Let us know if we need any other columns which helps 
users to get meaningful status.

Existing output
---------------
Status command output
     MASTER NODE - Master node hostname/IP
     MASTER VOL - Master volume name
     MASTER BRICK - Master brick path
     SLAVE - Slave host and Volume name(HOST::VOL format)
     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
     CHECKPOINT STATUS - Details about Checkpoint completion
     CRAWL STATUS - Hybrid/History/Changelog

Status detail -
     MASTER NODE - Master node hostname/IP
     MASTER VOL - Master volume name
     MASTER BRICK - Master brick path
     SLAVE - Slave host and Volume name(HOST::VOL format)
     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
     CHECKPOINT STATUS - Details about Checkpoint completion
     CRAWL STATUS - Hybrid/History/Changelog
     FILES SYNCD - Number of Files Synced
     FILES PENDING - Number of Files Pending
     BYTES PENDING - Bytes pending
     DELETES PENDING - Number of Deletes Pending
     FILES SKIPPED - Number of Files skipped


Issues with existing status and status detail:
----------------------------------------------

1. Active/Passive and Stable/faulty status is mixed up - Same column is 
used to show both active/passive status as well as Stable/faulty status. 
If Active node goes faulty then by looking at the status it is difficult 
to understand Active node is faulty or the passive one.
2. Info about last synced time, unless we set checkpoint it is difficult 
to understand till what time data is synced to slave. For example, if a 
admin want's to know all the files synced which are created 15 mins ago, 
it is not possible without setting checkpoint.
3. Wrong values in metrics.
4. When multiple bricks present in same node. Status shows Faulty when 
one of the worker is faulty in that node.

Changes:
--------
1. Active nodes will be prefixed with * to identify it is a active 
node.(In xml output active tag will be introduced with values 0 or 1)
2. New column will show the last synced time, which minimizes the use of 
checkpoint feature. Checkpoint status will be shown only in status detail.
3. Checkpoint Status is removed, Separate Checkpoint command will be 
added to gluster cli(We can introduce multiple Checkpoint feature with 
this change)
4. Status values will be "Not 
Started/Initializing/Started/Faulty/Stopped". Stable is changed to "Started"
5. Slave User column will be introduced to show to which user geo-rep 
session is established.(Useful in Non root geo-rep)
6. Bytes pending column will be removed. It is not possible to identify 
the delta without simulating sync. For example, we are using rsync to 
sync data from master to slave, If we need to know how much data to be 
transferred then we have to run the rsync command with --dry-run flag 
before running actual command. With tar-ssh we have to stat all the 
files which are identified to be synced to calculate the total bytes to 
be synced. Both are costly operations which degrades the geo-rep 
performance.(In Future we can include these columns)
7. Files pending, Synced, deletes pending are only session information 
of the worker, these numbers will not match with the number of files 
present in Filesystem. If worker restarts, counter will reset to zero. 
When worker restarts, it logs previous session stats before resetting it.
8. Files Skipped is persistent status across sessions, Shows exact count 
of number of files skipped(Can get list of GFIDs skipped from log file)
9. "Deletes Pending" column can be removed?

Example output

     MASTER NODE  MASTER VOL  MASTER BRICK  SLAVE USER SLAVE             
STATUS    LAST SYNCED           CRAWL
----------------------------------------------------------------------------------------------------------------
     * fedoravm1  gvm         /gfs/b1       root fedoravm3::gvs    
Started   2014-05-10 03:07 pm   Changelog
       fedoravm2  gvm         /gfs/b2       root fedoravm4::gvs    
Started   2014-05-10 03:07 pm   Changelog

New Status columns

     ACTIVE_PASSIVE - * if Active else none.
     MASTER NODE - Master node hostname/IP
     MASTER VOL - Master volume name
     MASTER BRICK - Master brick path
     SLAVE USER - Slave user to which geo-rep is established.
     SLAVE - Slave host and Volume name(HOST::VOL format)
     STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
     LAST SYNCED - Last synced time(Based on stime xattr)
     CHECKPOINT STATUS - Details about Checkpoint completion
     CRAWL STATUS - Hybrid/History/Changelog
     FILES SYNCD - Number of Files Synced
     FILES PENDING - Number of Files Pending
     DELETES PENDING- Number of Deletes Pending
     FILES SKIPPED - Number of Files skipped


XML output
     active
     master_node
     master_node_uuid
     master_brick
     slave_user
     slave
     status
     last_synced
     crawl_status
     files_syncd
     files_pending
     deletes_pending
     files_skipped


Checkpoints
===========
New set of Gluster CLI commands will be introduced for Checkpoints.

     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
checkpoint create <NAME> <DATE>
     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
checkpoint delete <NAME>
     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
checkpoint delete all
     gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL> 
checkpoint status [<NAME>]
     gluster volume geo-replication <VOLNAME> checkpoint status # For 
all geo-rep sessions for that volume
     gluster volume geo-replication checkpoint status # For all geo-rep 
sessions for all volumes


Checkpoint Status:

     SESSION                    NAME      Completed    Checkpoint 
Time        Completion Time
-----------------------------------------------------------------------------------------
     gvm->root at fedoravm3::gvs   Chk1      Yes          2014-11-30 11:30 
pm    2014-12-01 02:30 pm
     gvm->root at fedoravm3::gvs   Chk2      No           2014-12-01 10:00 
pm    N/A

XML output:
     session
     master_uuid
     name
     completed
     checkpoint_time
     completion_time


--
regards
Aravinda


More information about the Gluster-devel mailing list