[Gluster-devel] Improving Geo-replication Status and Checkpoints
Aravinda
avishwan at redhat.com
Mon Feb 2 11:21:06 UTC 2015
Thanks Sahina, replied inline.
--
regards
Aravinda
On 02/02/2015 12:55 PM, Sahina Bose wrote:
>
> On 01/28/2015 04:07 PM, Aravinda wrote:
>> Background
>> ----------
>> We have `status` and `status detail` commands for GlusterFS
>> geo-replication, This mail is to fix the existing issues in these
>> command outputs. Let us know if we need any other columns which helps
>> users to get meaningful status.
>>
>> Existing output
>> ---------------
>> Status command output
>> MASTER NODE - Master node hostname/IP
>> MASTER VOL - Master volume name
>> MASTER BRICK - Master brick path
>> SLAVE - Slave host and Volume name(HOST::VOL format)
>> STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>> CHECKPOINT STATUS - Details about Checkpoint completion
>> CRAWL STATUS - Hybrid/History/Changelog
>>
>> Status detail -
>> MASTER NODE - Master node hostname/IP
>> MASTER VOL - Master volume name
>> MASTER BRICK - Master brick path
>> SLAVE - Slave host and Volume name(HOST::VOL format)
>> STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>> CHECKPOINT STATUS - Details about Checkpoint completion
>> CRAWL STATUS - Hybrid/History/Changelog
>> FILES SYNCD - Number of Files Synced
>> FILES PENDING - Number of Files Pending
>> BYTES PENDING - Bytes pending
>> DELETES PENDING - Number of Deletes Pending
>> FILES SKIPPED - Number of Files skipped
>>
>>
>> Issues with existing status and status detail:
>> ----------------------------------------------
>>
>> 1. Active/Passive and Stable/faulty status is mixed up - Same column
>> is used to show both active/passive status as well as Stable/faulty
>> status. If Active node goes faulty then by looking at the status it
>> is difficult to understand Active node is faulty or the passive one.
>> 2. Info about last synced time, unless we set checkpoint it is
>> difficult to understand till what time data is synced to slave. For
>> example, if a admin want's to know all the files synced which are
>> created 15 mins ago, it is not possible without setting checkpoint.
>> 3. Wrong values in metrics.
>> 4. When multiple bricks present in same node. Status shows Faulty
>> when one of the worker is faulty in that node.
>>
>> Changes:
>> --------
>> 1. Active nodes will be prefixed with * to identify it is a active
>> node.(In xml output active tag will be introduced with values 0 or 1)
>> 2. New column will show the last synced time, which minimizes the use
>> of checkpoint feature. Checkpoint status will be shown only in status
>> detail.
>> 3. Checkpoint Status is removed, Separate Checkpoint command will be
>> added to gluster cli(We can introduce multiple Checkpoint feature
>> with this change)
>> 4. Status values will be "Not
>> Started/Initializing/Started/Faulty/Stopped". Stable is changed to
>> "Started"
>> 5. Slave User column will be introduced to show to which user geo-rep
>> session is established.(Useful in Non root geo-rep)
>> 6. Bytes pending column will be removed. It is not possible to
>> identify the delta without simulating sync. For example, we are using
>> rsync to sync data from master to slave, If we need to know how much
>> data to be transferred then we have to run the rsync command with
>> --dry-run flag before running actual command. With tar-ssh we have to
>> stat all the files which are identified to be synced to calculate the
>> total bytes to be synced. Both are costly operations which degrades
>> the geo-rep performance.(In Future we can include these columns)
>> 7. Files pending, Synced, deletes pending are only session
>> information of the worker, these numbers will not match with the
>> number of files present in Filesystem. If worker restarts, counter
>> will reset to zero. When worker restarts, it logs previous session
>> stats before resetting it.
>> 8. Files Skipped is persistent status across sessions, Shows exact
>> count of number of files skipped(Can get list of GFIDs skipped from
>> log file)
>> 9. "Deletes Pending" column can be removed?
>
> Is there any way to know if there are errors syncing any of the files?
> Which column would that reflect in?
"Skipped" Column shows number of files failed to sync to Slave.
> Is the last synced time - the least of the synced time across the nodes?
Status output will have one entry for each brick, so we are planning to
display last synced time from that brick.
>
>
>>
>> Example output
>>
>> MASTER NODE MASTER VOL MASTER BRICK SLAVE USER
>> SLAVE STATUS LAST SYNCED CRAWL
>> ----------------------------------------------------------------------------------------------------------------
>>
>> * fedoravm1 gvm /gfs/b1 root fedoravm3::gvs
>> Started 2014-05-10 03:07 pm Changelog
>> fedoravm2 gvm /gfs/b2 root fedoravm4::gvs
>> Started 2014-05-10 03:07 pm Changelog
>>
>> New Status columns
>>
>> ACTIVE_PASSIVE - * if Active else none.
>> MASTER NODE - Master node hostname/IP
>> MASTER VOL - Master volume name
>> MASTER BRICK - Master brick path
>> SLAVE USER - Slave user to which geo-rep is established.
>> SLAVE - Slave host and Volume name(HOST::VOL format)
>> STATUS - Stable/Faulty/Active/Passive/Stopped/Not Started
>> LAST SYNCED - Last synced time(Based on stime xattr)
>> CHECKPOINT STATUS - Details about Checkpoint completion
>> CRAWL STATUS - Hybrid/History/Changelog
>> FILES SYNCD - Number of Files Synced
>> FILES PENDING - Number of Files Pending
>> DELETES PENDING- Number of Deletes Pending
>> FILES SKIPPED - Number of Files skipped
>>
>>
>> XML output
>> active
>> master_node
>> master_node_uuid
>> master_brick
>> slave_user
>> slave
>> status
>> last_synced
>> crawl_status
>> files_syncd
>> files_pending
>> deletes_pending
>> files_skipped
>>
>>
>> Checkpoints
>> ===========
>> New set of Gluster CLI commands will be introduced for Checkpoints.
>>
>> gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL>
>> checkpoint create <NAME> <DATE>
>> gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL>
>> checkpoint delete <NAME>
>> gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL>
>> checkpoint delete all
>> gluster volume geo-replication <VOLNAME> <SLAVEHOST>::<SLAVEVOL>
>> checkpoint status [<NAME>]
>> gluster volume geo-replication <VOLNAME> checkpoint status # For
>> all geo-rep sessions for that volume
>> gluster volume geo-replication checkpoint status # For all
>> geo-rep sessions for all volumes
>>
>>
>> Checkpoint Status:
>>
>> SESSION NAME Completed Checkpoint
>> Time Completion Time
>> -----------------------------------------------------------------------------------------
>>
>> gvm->root at fedoravm3::gvs Chk1 Yes 2014-11-30 11:30 pm
>> 2014-12-01 02:30 pm
>> gvm->root at fedoravm3::gvs Chk2 No 2014-12-01 10:00 pm N/A
>
> Can the time information have the timezone information as well? Or is
> this UTC time?
> (Same comment for last synced time)
Sure. Will have UTC time in Status output.
>
>>
>> XML output:
>> session
>> master_uuid
>> name
>> completed
>> checkpoint_time
>> completion_time
>>
>>
>> --
>> regards
>> Aravinda
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>
More information about the Gluster-devel
mailing list