[Bugs] [Bug 1159195] New: dist-geo-rep: geo-rep status in one of rebooted node remains at "Stable(paused)" after session is resumed.

bugzilla at redhat.com bugzilla at redhat.com
Fri Oct 31 08:07:02 UTC 2014


https://bugzilla.redhat.com/show_bug.cgi?id=1159195

            Bug ID: 1159195
           Summary: dist-geo-rep: geo-rep status in one of rebooted node
                    remains at "Stable(paused)" after session is resumed.
           Product: GlusterFS
           Version: 3.6.0
         Component: geo-replication
          Keywords: ZStream
          Severity: medium
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: khiremat at redhat.com
                CC: aavati at redhat.com, avishwan at redhat.com,
                    bugs at gluster.org, csaba at redhat.com,
                    gluster-bugs at redhat.com, nlevinki at redhat.com,
                    rhs-bugs at redhat.com, smanjara at redhat.com,
                    ssaha at redhat.com, ssamanta at redhat.com,
                    storage-qa-internal at redhat.com, vbhat at redhat.com
        Depends On: 1142960, 1149982



+++ This bug was initially created as a clone of Bug #1149982 +++

+++ This bug was initially created as a clone of Bug #1142960 +++

Description of problem:
When you pause the geo-rep session and reboot one of the passive node and
resume the session after the session comes back up online, the status of the
node is stuck at "Stable(paused)" even after long a long. All other machines
have moved on to Active/Passive state except for the node which got rebooted.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Hit only once. Haven't tried but seems like a easily reproducible issue.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 282
dist-rep slave.
2. Now create some data and let it sync to slave.
3. pause the session using the geo-rep pause command. And check the status. It
should be "Stable(paused)"
4. Now reboot one of the Passive node and wait for the node to come back
online.
5. Check the status. All but the rebooted node should have "Stable(paused)"
state. And rebooted node should have "faulty(paused)" state.
6. Now resume the session using the geo-rep resume command.
7. And check the status

Actual results:
Now the status of the rebooted node gets stuck at "Stable(paused)" while other
node's state goes back to "Active/Passive"

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE              
STATUS            CHECKPOINT STATUS    CRAWL STATUS           
---------------------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com          master        /bricks/brick0    nirvana::slave     
Active            N/A                  Changelog Crawl        
metallica.blr.redhat.com    master        /bricks/brick1    acdc::slave        
Passive           N/A                  N/A                    
beatles.blr.redhat.com      master        /bricks/brick3    rammstein::slave   
Stable(Paused)    N/A                  N/A                    
pinkfloyd.blr.redhat.com    master        /bricks/brick2    led::slave         
Active            N/A                  Changelog Crawl        




Expected results:
All nodes should have the proper status updated after resume.


Additional info:
All the data got synced because the Active node was not paused (status was
showing Ative). Not sure what happens when an Active node gets rebooted.


I hit the same issue of one node in paused state and other node in
Active/Passive state even without a reboot. It happens intermittently. Does not
happen every time.

Resume fails following error. That geo-rep is not paused in following machines.

[root at pinkfloyd ~]# gluster v geo master acdc::slave resume
Staging failed on 10.70.43.127. Error: Geo-replication session between master
and acdc::slave is not Paused.
Staging failed on beatles. Error: Geo-replication session between master and
acdc::slave is not Paused.
Staging failed on metallica. Error: Geo-replication session between master and
acdc::slave is not Paused.
geo-replication command failed


Then pause fails with following message.

[root at pinkfloyd ~]# gluster v geo master acdc::slave pause
Geo-replication session between master and acdc::slave already Paused.
geo-replication command failed


This does not happen evrytime without reboot. The only workaround I found was
to stop and then restart geo-replication.


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1142960
[Bug 1142960] dist-geo-rep: geo-rep status in one of rebooted node remains
at "Stable(paused)" after session is resumed.
https://bugzilla.redhat.com/show_bug.cgi?id=1149982
[Bug 1149982] dist-geo-rep: geo-rep status in one of rebooted node remains
at "Stable(paused)" after session is resumed.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list