[Bugs] [Bug 1500346] [geo-rep]: Incorrect last sync "0" during hystory crawl after upgrade/stop-start

Tue Oct 10 12:33:25 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1500346

Kotresh HR <khiremat at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|bugs at gluster.org            |khiremat at redhat.com

--- Comment #1 from Kotresh HR <khiremat at redhat.com> ---

Description of problem:
=======================

Observed a scenario where lasy sync became zero post upgrade/reboot during
hystory crawl. Before upgrade started, the sync was "changelog crawl" with last
sync time as: "2017-07-21 12:51:55". However after upgrade and starting the
geo-rep, the last sync for few workers were shown as "0". The corresponding
status file shows "0"

[root at dhcp42-79 ~]# gluster volume geo-replication master 10.70.41.209::slave
status

MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE           
      SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.74     master        /rhs/brick1/b3     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick2/b7     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick3/b11    root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.41.217    master        /rhs/brick1/b4     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick2/b8     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick3/b12    root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick1/b2     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick2/b6     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick3/b10    root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
[root at dhcp42-79 ~]# 
[root at dhcp42-79 ~]# date
Sun Jul 23 11:04:25 IST 2017
[root at dhcp42-79 ~]#

[root at dhcp42-74 ~]# cd
/var/lib/glusterd/geo-replication/master_10.70.41.209_slave/
[root at dhcp42-74 master_10.70.41.209_slave]# ls
brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb7.status 
brick_%2Frhs%2Fbrick3%2Fb11.status  gsyncd.conf  monitor.pid  monitor.status
[root at dhcp42-74 master_10.70.41.209_slave]# cat
brick_%2Frhs%2Fbrick1%2Fb3.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta":
0, "failures": 0, "entry": 583, "slave_node": "10.70.41.202", "data": 2083,
"worker_status": "Active", "crawl_status": "History Crawl",
"checkpoint_completion_time": 0}[root at dhcp42-74 master_10.70.41.209_slave]# 
[root at dhcp42-74 master_10.70.41.209_slave]# cat
brick_%2Frhs%2Fbrick2%2Fb7.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta":
0, "failures": 0, "entry": 584, "slave_node": "10.70.41.202", "data": 2059,
"worker_status": "Active", "crawl_status": "History Crawl",
"checkpoint_completion_time": 0}[root at dhcp42-74 master_10.70.41.209_slave]# 
[root at dhcp42-74 master_10.70.41.209_slave]# cat
brick_%2Frhs%2Fbrick3%2Fb11.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta":
0, "failures": 0, "entry": 586, "slave_node": "10.70.41.202", "data": 2101,
"worker_status": "Active", "crawl_status": "History Crawl",
"checkpoint_completion_time": 0}[root at dhcp42-74 master_10.70.41.209_slave]# 
[root at dhcp42-74 master_10.70.41.209_slave]# cat monitor.status
Started[root at dhcp42-74 master_10.70.41.209_slave]# 

The status remained same for more than 10 mins until one batch did not sync

MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE           
      SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root         
10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21
12:51:55          
10.70.41.217    master        /rhs/brick1/b4     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick2/b8     root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.41.217    master        /rhs/brick3/b12    root         
10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A          
10.70.42.74     master        /rhs/brick1/b3     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick2/b7     root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.42.74     master        /rhs/brick3/b11    root         
10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A          
10.70.43.210    master        /rhs/brick1/b2     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick2/b6     root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
10.70.43.210    master        /rhs/brick3/b10    root         
10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A          
Sun Jul 23 11:14:50 IST 2017

Version-Release number of selected component (if applicable):
=============================================================
mainline

How reproducible:
=================

I remember seeing this only once before upon stop/start. Have tried upgrade
twice and seen this once. 

Steps to Reproduce:
===================

No specific steps, the systems were upgraded and as part of upgrade
geo-replication was stopped/started.

Actual results:
===============

Last sync is "0"

Expected results:
=================

Last sync should be what it was before geo-rep stopped. Looks like brick status
file was overwritten with "0" as last synced.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.