[Bugs] [Bug 1500853] [geo-rep]: Incorrect last sync "0" during hystory crawl after upgrade/stop-start

Wed Oct 11 15:18:06 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1500853

Kotresh HR <khiremat at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|bugs at gluster.org            |khiremat at redhat.com

--- Comment #1 from Kotresh HR <khiremat at redhat.com> ---
Description of problem:
=======================

Observed a scenario where lasy sync became zero post upgrade/reboot during
hystory crawl. Before upgrade started, the sync was "changelog crawl" with last
sync time as: "2017-07-21 12:51:55". However after upgrade and starting the
geo-rep, the last sync for few workers were shown as "0". The corresponding
status file shows "0"

[root at dhcp42-79 ~]# gluster volume geo-replication master 10.70.41.209::slave
status

MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE           
      SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.74     master        /rhs/brick1/b3     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick2/b7     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick3/b11    root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.41.217    master        /rhs/brick1/b4     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick2/b8     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick3/b12    root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick1/b2     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick2/b6     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick3/b10    root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
[root at dhcp42-79 ~]# 
[root at dhcp42-79 ~]# date
Sun Jul 23 11:04:25 IST 2017
[root at dhcp42-79 ~]#

[root at dhcp42-74 ~]# cd
/var/lib/glusterd/geo-replication/master_10.70.41.209_slave/
[root at dhcp42-74 master_10.70.41.209_slave]# ls
brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb7.status 
brick_%2Frhs%2Fbrick3%2Fb11.status  gsyncd.conf  monitor.pid  monitor.status
[root at dhcp42-74 master_10.70.41.209_slave]# cat
brick_%2Frhs%2Fbrick1%2Fb3.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta":
0, "failures": 0, "entry": 583, "slave_node": "10.70.41.202", "data": 2083,
"worker_status": "Active", "crawl_status": "History Crawl",
"checkpoint_completion_time": 0}[root at dhcp42-74 master_10.70.41.209_slave]# 
[root at dhcp42-74 master_10.70.41.209_slave]# cat
brick_%2Frhs%2Fbrick2%2Fb7.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta":
0, "failures": 0, "entry": 584, "slave_node": "10.70.41.202", "data": 2059,
"worker_status": "Active", "crawl_status": "History Crawl",
"checkpoint_completion_time": 0}[root at dhcp42-74 master_10.70.41.209_slave]# 
[root at dhcp42-74 master_10.70.41.209_slave]# cat
brick_%2Frhs%2Fbrick3%2Fb11.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta":
0, "failures": 0, "entry": 586, "slave_node": "10.70.41.202", "data": 2101,
"worker_status": "Active", "crawl_status": "History Crawl",
"checkpoint_completion_time": 0}[root at dhcp42-74 master_10.70.41.209_slave]# 
[root at dhcp42-74 master_10.70.41.209_slave]# cat monitor.status
Started[root at dhcp42-74 master_10.70.41.209_slave]# 

The status remained same for more than 10 mins until one batch did not sync

MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE           
      SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.41.217    master        /rhs/brick1/b4     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick2/b8     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick3/b12    root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.42.74     master        /rhs/brick1/b3     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick2/b7     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick3/b11    root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.43.210    master        /rhs/brick1/b2     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick2/b6     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick3/b10    root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
Sun Jul 23 11:14:50 IST 2017

Version-Release number of selected component (if applicable):
=============================================================
mainline

How reproducible:
=================

I remember seeing this only once before upon stop/start. Have tried upgrade
twice and seen this once. 

Steps to Reproduce:
===================

No specific steps, the systems were upgraded and as part of upgrade
geo-replication was stopped/started.

Actual results:
===============

Last sync is "0"

Expected results:
=================

Last sync should be what it was before geo-rep stopped. Looks like brick status
file was overwritten with "0" as last synced.

--- Additional comment from Worker Ant on 2017-10-10 08:34:12 EDT ---

REVIEW: https://review.gluster.org/18468 (geo-rep: Fix passive brick's last
sync time) posted (#1) for review on master by Kotresh HR (khiremat at redhat.com)

--- Additional comment from Worker Ant on 2017-10-11 11:16:39 EDT ---

COMMIT: https://review.gluster.org/18468 committed in master by Kotresh HR
(khiremat at redhat.com) 
------
commit f18a47ee7e6e06c9a9a8893aef7957f23a18de53
Author: Kotresh HR <khiremat at redhat.com>
Date:   Tue Oct 10 08:25:19 2017 -0400

    geo-rep: Fix passive brick's last sync time

    Passive brick's stime was not updated to the
    status file immediately after updating the brick
    root. As a result the last sync time was showing
    '0' until it finishes first crawl if passive
    worker becomes active after restart. Fix is to
    update the status file immediately after upgrading
    the brick root.

    Change-Id: I248339497303bad20b7f5a1d42ab44a1fe6bca99
    BUG: 1500346
    Signed-off-by: Kotresh HR <khiremat at redhat.com>

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.