[Bugs] [Bug 1374630] New: [geo-replication]: geo-rep Status is not showing bricks from one of the nodes
bugzilla at redhat.com
bugzilla at redhat.com
Fri Sep 9 09:11:29 UTC 2016
https://bugzilla.redhat.com/show_bug.cgi?id=1374630
Bug ID: 1374630
Summary: [geo-replication]: geo-rep Status is not showing
bricks from one of the nodes
Product: GlusterFS
Version: 3.9
Component: geo-replication
Severity: high
Assignee: bugs at gluster.org
Reporter: avishwan at redhat.com
CC: bugs at gluster.org, csaba at redhat.com,
rhinduja at redhat.com, rhs-bugs at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1369384, 1373741
+++ This bug was initially created as a clone of Bug #1373741 +++
+++ This bug was initially created as a clone of Bug #1369384 +++
Description of problem:
=======================
After upgrading the nodes from RHEL7.2 to RHEL7.3, reboot was required due to
kernel update. Upon rebooting bricks from one of the node didn't list in
geo-replication status command. However, the peer was in connected state and
bricks were all online. Upon checking the geo-replication directory, node was
missing monitor.pid. After doing touch, it resolves the issue.
1. Not sure what caused the monitor.pid to be removed. From user perspective,
only operation was reboot of whole cluster at once.
2. Even if it is removed, we should handle the ENOENT use case.
Initial:
========
[root at dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave
status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE
SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81 master /rhs/brick1/b1 root
10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22
16:11:07
10.70.37.81 master /rhs/brick2/b4 root
10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22
16:11:11
10.70.37.200 master /rhs/brick1/b3 root
10.70.37.80::slave 10.70.37.80 Passive N/A N/A
10.70.37.200 master /rhs/brick2/b6 root
10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22
16:10:59
10.70.37.100 master /rhs/brick1/b2 root
10.70.37.80::slave 10.70.37.208 Passive N/A N/A
10.70.37.100 master /rhs/brick2/b5 root
10.70.37.80::slave 10.70.37.80 Passive N/A N/A
[root at dhcp37-81 ~]#
After REHL Platform is updated and rebooted:
============================================
[root at dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave
status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE
SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81 master /rhs/brick1/b1 root
10.70.37.80::slave 10.70.37.80 Passive N/A N/A
10.70.37.81 master /rhs/brick2/b4 root
10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22
16:11:11
10.70.37.100 master /rhs/brick1/b2 root
10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22
16:11:04
10.70.37.100 master /rhs/brick2/b5 root
10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22
16:10:59
[root at dhcp37-81 ~]#
Peer and bricks of node 200 are all online:
=============================================
[root at dhcp37-81 ~]# gluster peer status
Number of Peers: 2
Hostname: 10.70.37.100
Uuid: 951c7434-89c2-4a66-a224-f3c2e5c7b06a
State: Peer in Cluster (Connected)
Hostname: 10.70.37.200
Uuid: db8ede6b-99b2-4369-8e65-8dd4d2fa54dc
State: Peer in Cluster (Connected)
[root at dhcp37-81 ~]#
[root at dhcp37-81 ~]# gluster volume status master
Status of volume: master
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.37.81:/rhs/brick1/b1 49152 0 Y 1639
Brick 10.70.37.100:/rhs/brick1/b2 49152 0 Y 1672
Brick 10.70.37.200:/rhs/brick1/b3 49152 0 Y 1683
Brick 10.70.37.81:/rhs/brick2/b4 49153 0 Y 1662
Brick 10.70.37.100:/rhs/brick2/b5 49153 0 Y 1673
Brick 10.70.37.200:/rhs/brick2/b6 49153 0 Y 1678
Snapshot Daemon on localhost 49155 0 Y 1776
NFS Server on localhost 2049 0 Y 1701
Self-heal Daemon on localhost N/A N/A Y 1709
Quota Daemon on localhost N/A N/A Y 1718
Snapshot Daemon on 10.70.37.100 49155 0 Y 1798
NFS Server on 10.70.37.100 2049 0 Y 1729
Self-heal Daemon on 10.70.37.100 N/A N/A Y 1737
Quota Daemon on 10.70.37.100 N/A N/A Y 1745
Snapshot Daemon on 10.70.37.200 49155 0 Y 1817
NFS Server on 10.70.37.200 2049 0 Y 1644
Self-heal Daemon on 10.70.37.200 N/A N/A Y 1649
Quota Daemon on 10.70.37.200 N/A N/A Y 1664
Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks
[root at dhcp37-81 ~]#
Problematic node 200:
=====================
[root at dhcp37-200 ~]# python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py
-c /var/lib/glusterd/geo-replication/master_10.70.37.80_slave/gsyncd.conf
--status-get :master 10.70.37.80::slave --path /rhs/brick1/b3/
[2016-08-23 08:56:10.248389] E [syncdutils:276:log_raise_exception] <top>:
FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
main_i()
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 681, in
main_i
brick_status.print_status(checkpoint_time=checkpoint_time)
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 343, in
print_status
for key, value in self.get_status(checkpoint_time).items():
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py", line 262, in
get_status
with open(self.monitor_pid_file, "r+") as f:
IOError: [Errno 2] No such file or directory:
'/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid'
failed with IOError.
[root at dhcp37-200 ~]# ls /var/lib/glusterd/geo-replication/
gsyncd_template.conf master_10.70.37.80_slave/ secret.pem
secret.pem.pub tar_ssh.pem tar_ssh.pem.pub
[root at dhcp37-200 ~]# ls
/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status brick_%2Frhs%2Fbrick1%2Fb3.status
brick_%2Frhs%2Fbrick2%2Fb6.status gsyncd.conf monitor.status
[root at dhcp37-200 ~]# touch
/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/monitor.pid
[root at dhcp37-200 ~]# ls
/var/lib/glusterd/geo-replication/master_10.70.37.80_slave/
brick_%2Frhs%2Fbrick1%2Fb3%2F.status brick_%2Frhs%2Fbrick1%2Fb3.status
brick_%2Frhs%2Fbrick2%2Fb6.status gsyncd.conf monitor.pid monitor.status
[root at dhcp37-200 ~]#
After touch, status shows information but stopped:
==================================================
[root at dhcp37-81 ~]# gluster volume geo-replication master 10.70.37.80::slave
status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE
SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.37.81 master /rhs/brick1/b1 root
10.70.37.80::slave 10.70.37.80 Passive N/A N/A
10.70.37.81 master /rhs/brick2/b4 root
10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22
16:11:11
10.70.37.100 master /rhs/brick1/b2 root
10.70.37.80::slave 10.70.37.208 Active Changelog Crawl 2016-08-22
16:11:04
10.70.37.100 master /rhs/brick2/b5 root
10.70.37.80::slave 10.70.37.80 Active Changelog Crawl 2016-08-22
16:10:59
10.70.37.200 master /rhs/brick1/b3 root
10.70.37.80::slave N/A Stopped N/A N/A
10.70.37.200 master /rhs/brick2/b6 root
10.70.37.80::slave N/A Stopped N/A N/A
[root at dhcp37-81 ~]#
Version-Release number of selected component (if applicable):
=============================================================
glusterfs-geo-replication-3.7.9-10.el7rhgs.x86_64
glusterfs-3.7.9-10.el7rhgs.x86_64
How reproducible:
=================
This case was different in a sense that all the nodes in cluster were brought
offline at the same time while geo-replication was in started state. kind of
negative testing.
--- Additional comment from Worker Ant on 2016-09-07 02:15:58 EDT ---
REVIEW: http://review.gluster.org/15416 (geo-rep: Fix Geo-rep status if
monitor.pid file not exists) posted (#1) for review on master by Aravinda VK
(avishwan at redhat.com)
--- Additional comment from Worker Ant on 2016-09-08 12:15:19 EDT ---
COMMIT: http://review.gluster.org/15416 committed in master by Aravinda VK
(avishwan at redhat.com)
------
commit c7118a92f52a2fa33ab69f3e3ef1bdabfee847cf
Author: Aravinda VK <avishwan at redhat.com>
Date: Wed Sep 7 11:39:39 2016 +0530
geo-rep: Fix Geo-rep status if monitor.pid file not exists
If monitor.pid file not exists, gsyncd fails with following traceback
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
line 201, in main
main_i()
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py",
line 681, in main_i
brick_status.print_status(checkpoint_time=checkpoint_time)
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
line 343, in print_status
for key, value in self.get_status(checkpoint_time).items():
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncdstatus.py",
line 262, in get_status
with open(self.monitor_pid_file, "r+") as f:
IOError: [Errno 2] No such file or directory: '/var/lib/glusterd/
geo-replication/master_node_slave/monitor.pid'
If Georep status command this worker's status will not be displayed
since not returning expected status output.
BUG: 1373741
Change-Id: I600a2f5d9617f993d635b9bc6e393108500db5f9
Signed-off-by: Aravinda VK <avishwan at redhat.com>
Reviewed-on: http://review.gluster.org/15416
Smoke: Gluster Build System <jenkins at build.gluster.org>
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
Reviewed-by: Kotresh HR <khiremat at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1369384
[Bug 1369384] [geo-replication]: geo-rep Status is not showing bricks from
one of the nodes
https://bugzilla.redhat.com/show_bug.cgi?id=1373741
[Bug 1373741] [geo-replication]: geo-rep Status is not showing bricks from
one of the nodes
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list