[Bugs] [Bug 1668118] New: Failure to start geo-replication for tiered volume.

Mon Jan 21 23:21:09 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1668118

            Bug ID: 1668118
           Summary: Failure to start geo-replication for tiered volume.
           Product: GlusterFS
           Version: 5
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: geo-replication
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: vnosov at stonefly.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Description of problem: Status of geo-replication workers on master nodes is
"inconsistent" if master volume is tiered. 

Version-Release number of selected component (if applicable):

GlusterFS 5.2 installation from source code TAR file

How reproducible:  100%

Steps to Reproduce:

1. Set up two nodes. One will host geo-replication master volume. Master volume
has to be tiered. Other node will host geo-replication slave volume.

[root at SC-10-10-63-182 log]# glusterfsd --version
glusterfs 5.2

[root at SC-10-10-63-183 log]# glusterfsd --version
glusterfs 5.2

2. On master node create tiered volume:

[root at SC-10-10-63-182 log]# gluster volume info master-volume-1

Volume Name: master-volume-1
Type: Tier
Volume ID: aa95df34-f181-456c-aa26-9756b68ed679
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distribute
Number of Bricks: 1
Brick1: 10.10.60.182:/exports/master-hot-tier/master-volume-1
Cold Tier:
Cold Tier Type : Distribute
Number of Bricks: 1
Brick2: 10.10.60.182:/exports/master-segment-1/master-volume-1
Options Reconfigured:
features.ctr-sql-db-wal-autocheckpoint: 25000
features.ctr-sql-db-cachesize: 12500
cluster.tier-mode: cache
features.ctr-enabled: on
server.allow-insecure: on
performance.quick-read: off
performance.stat-prefetch: off
nfs.addr-namelookup: off
transport.address-family: inet
nfs.disable: on
cluster.enable-shared-storage: disable
snap-activate-on-create: enable

[root at SC-10-10-63-182 log]# gluster volume status master-volume-1
Status of volume: master-volume-1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.10.60.182:/exports/master-hot-tier
/master-volume-1                            62001     0          Y       15690
Cold Bricks:
Brick 10.10.60.182:/exports/master-segment-
1/master-volume-1                           62000     0          Y       9762
Tier Daemon on localhost                    N/A       N/A        Y       15713

Task Status of Volume master-volume-1
------------------------------------------------------------------------------
There are no active volume tasks

[root at SC-10-10-63-182 log]# gluster volume tier master-volume-1 status
Node                 Promoted files       Demoted files        Status          
    run time in h:m:s
---------            ---------            ---------            ---------       
    ---------
localhost            0                    0                    in progress     
    0:3:40
Tiering Migration Functionality: master-volume-1: success

3. On slave node create slave volume:

[root at SC-10-10-63-183 log]# gluster volume info slave-volume-1

Volume Name: slave-volume-1
Type: Distribute
Volume ID: 569a340b-35f8-4109-8816-720982b11806
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.10.60.183:/exports/slave-segment-1/slave-volume-1
Options Reconfigured:
server.allow-insecure: on
performance.quick-read: off
performance.stat-prefetch: off
nfs.addr-namelookup: off
transport.address-family: inet
nfs.disable: on
cluster.enable-shared-storage: disable
snap-activate-on-create: enable

[root at SC-10-10-63-183 log]# gluster volume status slave-volume-1
Status of volume: slave-volume-1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.10.60.183:/exports/slave-segment-1
/slave-volume-1                             62000     0          Y       2532

Task Status of Volume slave-volume-1
------------------------------------------------------------------------------
There are no active volume tasks

4. Set up SSH access to slave node:

SSH from 182 to 183:

20660 01/21/2019 13:58:54.930122501 1548107934 command: /usr/bin/ssh
nasgorep at 10.10.60.183 /bin/pwd
20660 01/21/2019 13:58:55.021906148 1548107935 status=0 /usr/bin/ssh
nasgorep at 10.10.60.183 /bin/pwd
20694 01/21/2019 13:58:56.169890800 1548107936 command: /usr/bin/ssh -q
-oConnectTimeout=5 nasgorep at 10.10.60.183 /bin/pwd 2>&1
20694 01/21/2019 13:58:56.256032202 1548107936 status=0 /usr/bin/ssh -q
-oConnectTimeout=5 nasgorep at 10.10.60.183 /bin/pwd 2>&1

5. Initialize geo-replication from master volume to slave volume:

[root at SC-10-10-63-182 log]# vi /var/log/glusterfs/cmd_history.log

[2019-01-21 21:59:08.942567]  : system:: execute gsec_create : SUCCESS
[2019-01-21 21:59:42.722194]  : volume geo-replication master-volume-1
nasgorep at 10.10.60.183::slave-volume-1 create push-pem : SUCCESS
[2019-01-21 21:59:49.527353]  : volume geo-replication master-volume-1
nasgorep at 10.10.60.183::slave-volume-1 start : SUCCESS
[2019-01-21 21:59:55.636198]  : volume geo-replication master-volume-1
nasgorep at 10.10.60.183::slave-volume-1 status detail : SUCCESS

6. Check status of the geo-replication:

Actual results:

[root at SC-10-10-63-183 log]# /usr/sbin/gluster-mountbroker status
+-----------+-------------+---------------------------+--------------+---------------------------+
|    NODE   | NODE STATUS |         MOUNT ROOT        |    GROUP     |         
 USERS           |
+-----------+-------------+---------------------------+--------------+---------------------------+
| localhost |          UP | /var/mountbroker-root(OK) | nasgorep(OK) |
nasgorep(slave-volume-1)  |
+-----------+-------------+---------------------------+--------------+---------------------------+

[root at SC-10-10-63-182 log]# gluster volume geo-replication master-volume-1
nasgorep at 10.10.60.183::slave-volume-1 status

MASTER NODE     MASTER VOL         MASTER BRICK                                
SLAVE USER    SLAVE                                    SLAVE NODE    STATUS    
CRAWL STATUS    LAST_SYNCED
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.10.60.182    master-volume-1    /exports/master-hot-tier/master-volume-1    
nasgorep      nasgorep at 10.10.60.183::slave-volume-1    N/A           Stopped   
N/A             N/A
10.10.60.182    master-volume-1    /exports/master-segment-1/master-volume-1   
nasgorep      nasgorep at 10.10.60.183::slave-volume-1    N/A           Stopped   
N/A             N/A

Expected results:

Status of the geo-replication workers on master node has to be "Active".

Additional info:

Contents of file
/var/log/glusterfs/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.log
on master node has explanation what is wrong:

[root at SC-10-10-63-182 log]# vi
/var/log/glusterfs/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.log

[2019-01-21 21:59:39.347943] W [gsyncd(config-get):304:main] <top>: Session
config file not exists, using the default config   
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:42.438145] I [gsyncd(monitor-status):308:main] <top>: Using
session config file  
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:42.454929] I
[subcmds(monitor-status):29:subcmd_monitor_status] <top>: Monitor Status Change
 status=Created
[2019-01-21 21:59:48.756702] I [gsyncd(config-get):308:main] <top>: Using
session config file  
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.4720] I [gsyncd(config-get):308:main] <top>: Using session
config file
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.239733] I [gsyncd(config-get):308:main] <top>: Using
session config file  
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.475193] I [gsyncd(monitor):308:main] <top>: Using session
config file 
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.868150] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Initializing...
[2019-01-21 21:59:49.868396] I [monitor(monitor):157:monitor] Monitor: starting
gsyncd worker   slave_node=10.10.60.183
brick=/exports/master-segment-1/master-volume-1
[2019-01-21 21:59:49.871593] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change status=Initializing...
[2019-01-21 21:59:49.871963] I [monitor(monitor):157:monitor] Monitor: starting
gsyncd worker   slave_node=10.10.60.183
brick=/exports/master-hot-tier/master-volume-1
[2019-01-21 21:59:50.4395] I [monitor(monitor):268:monitor] Monitor: worker
died before establishing connection
brick=/exports/master-segment-1/master-volume-1
[2019-01-21 21:59:50.7447] I [monitor(monitor):268:monitor] Monitor: worker
died before establishing connection
brick=/exports/master-hot-tier/master-volume-1
[2019-01-21 21:59:50.8415] I [gsyncd(agent
/exports/master-segment-1/master-volume-1):308:main] <top>: Using session
config file   
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:50.10383] I [gsyncd(agent
/exports/master-hot-tier/master-volume-1):308:main] <top>: Using session config
file   
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:50.14039] I [repce(agent
/exports/master-segment-1/master-volume-1):97:service_loop] RepceServer:
terminating on reaching EOF.
[2019-01-21 21:59:50.15556] I [changelogagent(agent
/exports/master-hot-tier/master-volume-1):72:__init__] ChangelogAgent: Agent
listining...
[2019-01-21 21:59:50.15964] I [repce(agent
/exports/master-hot-tier/master-volume-1):97:service_loop] RepceServer:
terminating on reaching EOF.
[2019-01-21 21:59:55.141768] I [gsyncd(config-get):308:main] <top>: Using
session config file  
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:55.380496] I [gsyncd(status):308:main] <top>: Using session
config file  
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:55.625045] I [gsyncd(status):308:main] <top>: Using session
config file  
path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 22:00:00.66032] I [gsyncdstatus(monitor):248:set_worker_status]
GeorepStatus: Worker Status Change  status=inconsistent
[2019-01-21 22:00:00.66289] E [syncdutils(monitor):338:log_raise_exception]
<top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in
twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 339, in wmon
    slave_host, master, suuid, slavenodes)
TypeError: 'int' object is not iterable

Similar test on GlusterFS 3.12.14 does not show the same failure.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.