[Gluster-devel] A new approach to solve Geo-replication ACTIVE/PASSIVE switching in distributed replicate setup!
Kotresh Hiremath Ravishankar
khiremat at redhat.com
Thu Feb 26 12:39:36 UTC 2015
We chose fcntl over sanlock as self healing of locks is being implemented.
Patch is up for review.
Thanks and Regards,
Kotresh H R
----- Original Message -----
> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> To: gluster-devel at gluster.org
> Sent: Monday, February 23, 2015 10:37:01 AM
> Subject: Re: [Gluster-devel] A new approach to solve Geo-replication ACTIVE/PASSIVE switching in distributed
> replicate setup!
> Hi All,
> The logic discussed in previous mail thread is not feasible. So in order to
> the Active/Passive switching in geo-replication, following new idea is
> thought off.
> 1. Have a shared storage, a glusterfs management volume specific to
> 2. Use fcntl lock on a file stored on above said shared volume. There will be
> one file
> per replica set.
> Each worker tries to lock the file on shared storage, who ever wins will be
> With this, we are able to solve the problem but there is an issue when the
> storage goes down (if it is replica, when all replicas goes down). In that
> the lock state is lost.
> But if we use sanlock, as ovirt uses, I think the above problem of lock state
> lost could be solved ?
> If anybody have used sanlocks, is it a good option in this respect ?
> Please share your thoughts, suggestions on this.
> Thanks and Regards,
> Kotresh H R
> ----- Original Message -----
> > From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> > To: gluster-devel at gluster.org
> > Sent: Monday, December 22, 2014 10:53:34 AM
> > Subject: [Gluster-devel] A new approach to solve Geo-replication
> > ACTIVE/PASSIVE switching in distributed replicate
> > setup!
> > Hi All,
> > Current Desgin and its limitations:
> > Geo-replication syncs changes across geography using changelogs
> > captured
> > by changelog translator. Changelog translator sits on server side just
> > above posix
> > translator. Hence, in distributed replicated setup, both replica pairs
> > collect
> > changelogs w.r.t their bricks. Geo-replication syncs the changes using
> > only
> > one
> > brick among the replica pair at a time, calling it as "ACTIVE" and other
> > non syncing
> > brick as "PASSIVE".
> > Let's consider below example of distributed replicated setup where
> > NODE-1 as b1 and its replicated brick b1r is in NODE-2
> > NODE-1 NODE-2
> > b1 b1r
> > At the beginning, geo-replication chooses to sync changes from NODE-1:b1
> > and
> > NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr
> > 'trusted.glusterfs.node-uuid' which always returns first up subvolume
> > i.e.,
> > NODE-1.
> > When NODE-1 goes down, the above xattr returns NODE-2 and that is made
> > 'ACTIVE'.
> > But when NODE-1 comes back again, the above xattr returns NODE-1 and it
> > is
> > made
> > 'ACTIVE' again. So for a brief interval of time, if NODE-2 had not
> > finished
> > processing
> > the changelog, both NODE-2 and NODE-1 will be ACTIVE causing rename race
> > as below.
> > https://bugzilla.redhat.com/show_bug.cgi?id=1140183
> > SOLUTION:
> > Don't make NODE-2 'PASSIVE' when NODE-1 comes back again untill NODE-2
> > goes down.
> > APPROACH TO SOLVE WHICH I CAN THINK OF:
> > Have a distributed store of a file, which captures the bricks which are
> > active.
> > When a NODE goes down, the file is updated with it's replica bricks making
> > sure, at any point in time, the file has all the bricks to be made active.
> > Geo-replication worker process is made 'ACTIVE' only if it is in the file.
> > Implementation can be in two ways:
> > 1. Have a distributed store for above implementation. This needs to be
> > thought
> > of as distributed store is not in place in glusterd yet.
> > 2. Other solution is to store in a file similar to existing glusterd
> > global
> > configuration file (/var/lib/glusterd/options). When this file is
> > updated,
> > version number is incremented. When the node which is gone down, comes
> > up,
> > gets this file from peers if it's version number is less that of
> > peers.
> > I did a POC with second approach storing list of active bricks
> > 'NodeUUID:brickpath'
> > in options file itself. It seems to work fine except the bug in glusterd
> > where the
> > daemons are getting spawned before the node gets 'options' file from other
> > node during
> > handshake.
> > CHANGES IN GLUSTERD:
> > When a node goes down, all the other nodes are notified through
> > glusterd_peer_rpc_notify,
> > where, it needs to find the replicas of the node which went down and
> > update
> > the global
> > file.
> > PROBLEMS/LIMITATIONS WITH THIS APPRAOCH:
> > 1. If glusterd is killed and the node is still up, this makes the other
> > replica 'ACTIVE'.
> > So both replica bricks will be syncing at this point of time which
> > is
> > not expected.
> > 2. If the single brick process is killed, it's replica brick is not
> > made
> > 'ACTIVE'.
> > Glusterd/AFR folks,
> > 1. Do you see a better approach other than above to solve this issue?
> > 2. Is this approach feasible? If yes, how can I handle the problems
> > mentioned above ?
> > 3. Is this approach feasible from scalability point of view since
> > complete list of active
> > brick path is stored and read by gsyncd ?
> > 3. Does this approach fits into three way replication and erasure
> > coding?
> > Thanks and Regards,
> > Kotresh H R
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel