[Bugs] [Bug 1374632] New: [geo-replication]: geo-rep Status is not showing bricks from one of the nodes

Fri Sep 9 09:13:06 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1374632

            Bug ID: 1374632
           Summary: [geo-replication]: geo-rep Status is not showing
                    bricks from one of the nodes
           Product: GlusterFS
           Version: 3.8.3
         Component: geo-replication
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: avishwan at redhat.com
                CC: bugs at gluster.org, csaba at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1369384, 1373741
            Blocks: 1374630

+++ This bug was initially created as a clone of Bug #1373741 +++

+++ This bug was initially created as a clone of Bug #1369384 +++

Description of problem:
=======================

After upgrading the nodes from RHEL7.2 to RHEL7.3, reboot was required due to
kernel update. Upon rebooting bricks from one of the node didn't list in
geo-replication status command. However, the peer was in connected state and
bricks were all online. Upon checking the geo-replication directory, node was
missing monitor.pid. After doing touch, it resolves the issue. 

1. Not sure what caused the monitor.pid to be removed. From user perspective,
only operation was reboot of whole cluster at once. 
2. Even if it is removed, we should handle the ENOENT use case. 

Initial:
========

[root at dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave
status

MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE            
    SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root         
10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22
16:11:07          
10.70.37.81     master        /rhs/brick2/b4    root         
10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22
16:11:11          
10.70.37.200    master        /rhs/brick1/b3    root         
10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A         
10.70.37.200    master        /rhs/brick2/b6    root         
10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22
16:10:59          
10.70.37.100    master        /rhs/brick1/b2    root         
10.70.37.80::slave    10.70.37.208    Passive    N/A                N/A         
10.70.37.100    master        /rhs/brick2/b5    root         
10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A         
[root at dhcp37-81 ~]#

After REHL Platform is updated and rebooted:
============================================

[root at dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave
status

MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE            
    SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root         
10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A         
10.70.37.81     master        /rhs/brick2/b4    root         
10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22
16:11:11          
10.70.37.100    master        /rhs/brick1/b2    root         
10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22
16:11:04          
10.70.37.100    master        /rhs/brick2/b5    root         
10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22
16:10:59          
[root at dhcp37-81 ~]# 

Peer and bricks of node 200 are all online:
=============================================

[root at dhcp37-81 ~]# gluster peer status
Number of Peers: 2

Hostname: 10.70.37.100
Uuid: 951c7434-89c2-4a66-a224-f3c2e5c7b06a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.200
Uuid: db8ede6b-99b2-4369-8e65-8dd4d2fa54dc
State: Peer in Cluster (Connected)
[root at dhcp37-81 ~]# 

[root at dhcp37-81 ~]# gluster volume status master
Status of volume: master
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.81:/rhs/brick1/b1            49152     0          Y       1639 
Brick 10.70.37.100:/rhs/brick1/b2           49152     0          Y       1672 
Brick 10.70.37.200:/rhs/brick1/b3           49152     0          Y       1683 
Brick 10.70.37.81:/rhs/brick2/b4            49153     0          Y       1662 
Brick 10.70.37.100:/rhs/brick2/b5           49153     0          Y       1673 
Brick 10.70.37.200:/rhs/brick2/b6           49153     0          Y       1678 
Snapshot Daemon on localhost                49155     0          Y       1776 
NFS Server on localhost                     2049      0          Y       1701 
Self-heal Daemon on localhost               N/A       N/A        Y       1709 
Quota Daemon on localhost                   N/A       N/A        Y       1718 
Snapshot Daemon on 10.70.37.100             49155     0          Y       1798 
NFS Server on 10.70.37.100                  2049      0          Y       1729 
Self-heal Daemon on 10.70.37.100            N/A       N/A        Y       1737 
Quota Daemon on 10.70.37.100                N/A       N/A        Y       1745 
Snapshot Daemon on 10.70.37.200             49155     0          Y       1817 
NFS Server on 10.70.37.200                  2049      0          Y       1644 
Self-heal Daemon on 10.70.37.200            N/A       N/A        Y       1649 
Quota Daemon on 10.70.37.200                N/A       N/A        Y       1664 

Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks

[root at dhcp37-81 ~]# 

Problematic node 200:
=====================

[root at dhcp37-200 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py
-c /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/gsyncd.conf
--status-get :master 10.70.37.80::slave --path /rhs/brick1/b3/
[2016-08-23 08:56:10.248389] E [syncdutils:276:log_raise_exception] <top>:
FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in
main_i
    brick_status.print_status(checkpoint_time=checkpoint_time)
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in
print_status
    for key, value in self.get_status(checkpoint_time).items():
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in
get_status
    with open(self.monitor_pid_file, "r+") as f:
IOError: [Errno 2] No such file or directory:
'/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid'
failed with IOError.
[root at dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/
gsyncd_template.conf      master_10.70.37.80_slave/ secret.pem               
secret.pem.pub            tar_ssh.pem               tar_ssh.pem.pub           
[root at dhcp37-200 ~]# ls
/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status  brick_%2Frhs%2Fbrick1%2Fb3.status 
brick_%2Frhs%2Fbrick2%2Fb6.status  gsyncd.conf  monitor.status
[root at dhcp37-200 ~]# touch
/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid
[root at dhcp37-200 ~]# ls
/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status  brick_%2Frhs%2Fbrick1%2Fb3.status 
brick_%2Frhs%2Fbrick2%2Fb6.status  gsyncd.conf  monitor.pid  monitor.status
[root at dhcp37-200 ~]# 

After touch, status shows information but stopped:
==================================================

[root at dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave
status

MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE            
    SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81     master        /rhs/brick1/b1    root         
10.70.37.80::slave    10.70.37.80     Passive    N/A                N/A         
10.70.37.81     master        /rhs/brick2/b4    root         
10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22
16:11:11          
10.70.37.100    master        /rhs/brick1/b2    root         
10.70.37.80::slave    10.70.37.208    Active     Changelog Crawl    2016-08-22
16:11:04          
10.70.37.100    master        /rhs/brick2/b5    root         
10.70.37.80::slave    10.70.37.80     Active     Changelog Crawl    2016-08-22
16:10:59          
10.70.37.200    master        /rhs/brick1/b3    root         
10.70.37.80::slave    N/A             Stopped    N/A                N/A         
10.70.37.200    master        /rhs/brick2/b6    root         
10.70.37.80::slave    N/A             Stopped    N/A                N/A         
[root at dhcp37-81 ~]#

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-10.el7rhgs.x86_64
glusterfs-3.7.9-10.el7rhgs.x86_64

How reproducible:
=================
This case was different in a sense that all the nodes in cluster were brought
offline at the same time while geo-replication was in started state. kind of
negative testing.

--- Additional comment from Worker Ant on 2016-09-07 02:15:58 EDT ---

REVIEW: http://review.gluster.org/15416 (geo-rep: Fix Geo-rep status if
monitor.pid file not exists) posted (#1) for review on master by Aravinda VK
(avishwan at redhat.com)

--- Additional comment from Worker Ant on 2016-09-08 12:15:19 EDT ---

COMMIT: http://review.gluster.org/15416 committed in master by Aravinda VK
(avishwan at redhat.com) 
------
commit c7118a92f52a2fa33ab69f3e3ef1bdabfee847cf
Author: Aravinda VK <avishwan at redhat.com>
Date:   Wed Sep 7 11:39:39 2016 +0530

    geo-rep: Fix Geo-rep status if monitor.pid file not exists

    If monitor.pid file not exists, gsyncd fails with following traceback

    Traceback (most recent call last):
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
      line 201, in main
        main_i()
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
      line 681, in main_i
        brick_status.print_status(checkpoint_time=checkpoint_time)
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
      line 343, in print_status
        for key, value in self.get_status(checkpoint_time).items():
      File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
      line 262, in get_status
        with open(self.monitor_pid_file, "r+") as f:
    IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/
      geo-replication/master_node_slave/monitor.pid'

    If Georep status command this worker's status will not be displayed
    since not returning expected status output.

    BUG: 1373741
    Change-Id: I600a2f5d9617f993d635b9bc6e393108500db5f9
    Signed-off-by: Aravinda VK <avishwan at redhat.com>
    Reviewed-on: http://review.gluster.org/15416
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Kotresh HR <khiremat at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1369384
[Bug 1369384] [geo-replication]: geo-rep Status is not showing bricks from
one of the nodes
https://bugzilla.redhat.com/show_bug.cgi?id=1373741
[Bug 1373741] [geo-replication]: geo-rep Status is not showing bricks from
one of the nodes
https://bugzilla.redhat.com/show_bug.cgi?id=1374630
[Bug 1374630] [geo-replication]: geo-rep Status is not showing bricks from
one of the nodes
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.