[Gluster-devel] A new approach to solve Geo-replication ACTIVE/PASSIVE switching in distributed replicate setup!

Thu Feb 26 12:39:36 UTC 2015

Hi All,

We chose fcntl over sanlock as self healing of locks is being implemented.
Patch is up for review.

http://review.gluster.org/#/c/9759/

Thanks and Regards,
Kotresh H R

----- Original Message -----
> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> To: gluster-devel at gluster.org
> Sent: Monday, February 23, 2015 10:37:01 AM
> Subject: Re: [Gluster-devel] A new approach to solve Geo-replication ACTIVE/PASSIVE switching in distributed
> replicate setup!
> 
> Hi All,
> 
> The logic discussed in previous mail thread is not feasible.  So in order to
> solve
> the Active/Passive switching in geo-replication, following new idea is
> thought off.
> 
> 1. Have a shared storage, a glusterfs management volume specific to
> geo-replication.
> 
> 2. Use fcntl lock on a file stored on above said shared volume. There will be
> one file
>    per replica set.
> 
> Each worker tries to lock the file on shared storage, who ever wins will be
> ACTIVE.
> With this, we are able to solve the problem but there is an issue when the
> shared
> storage goes down (if it is replica, when all replicas goes down). In that
> case,
> the lock state is lost.
> 
> But if we use sanlock, as ovirt uses, I think the above problem of lock state
> being
> lost could be solved ?
> https://fedorahosted.org/sanlock/
> 
> If anybody have used sanlocks, is it a good option in this respect ?
> Please share your thoughts, suggestions on this.
> 
> 
> Thanks and Regards,
> Kotresh H R
> 
> ----- Original Message -----
> > From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> > To: gluster-devel at gluster.org
> > Sent: Monday, December 22, 2014 10:53:34 AM
> > Subject: [Gluster-devel] A new approach to solve Geo-replication
> > ACTIVE/PASSIVE switching in distributed replicate
> > setup!
> > 
> > Hi All,
> > 
> > Current Desgin and its limitations:
> > 
> >         Geo-replication syncs changes across geography using changelogs
> >         captured
> >   by changelog translator. Changelog translator sits on server side just
> >   above posix
> >   translator. Hence, in distributed replicated setup, both replica pairs
> >   collect
> >   changelogs w.r.t their bricks. Geo-replication syncs the changes using
> >   only
> >   one
> >   brick among the replica pair at a time, calling it as "ACTIVE" and other
> >   non syncing
> >   brick as "PASSIVE".
> >    
> >          Let's consider below example of distributed replicated setup where
> >          NODE-1 as b1 and its replicated brick b1r is in NODE-2
> > 
> >                         NODE-1                         NODE-2
> >                           b1                            b1r
> > 
> >   At the beginning, geo-replication chooses to sync changes from NODE-1:b1
> >   and
> >   NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr
> >   'trusted.glusterfs.node-uuid' which always returns first up subvolume
> >   i.e.,
> >   NODE-1.
> >   When NODE-1 goes down, the above xattr returns NODE-2 and that is made
> >   'ACTIVE'.
> >   But when NODE-1 comes back again, the above xattr returns NODE-1 and it
> >   is
> >   made
> >   'ACTIVE' again. So for a brief interval of time, if NODE-2 had not
> >   finished
> >   processing
> >    the changelog, both NODE-2 and NODE-1 will be ACTIVE causing rename race
> >    as below.
> >    
> >    https://bugzilla.redhat.com/show_bug.cgi?id=1140183
> > 
> > 
> > SOLUTION:
> >    Don't make NODE-2 'PASSIVE' when NODE-1 comes back again untill NODE-2
> >    goes down.
> > 
> > 
> > APPROACH TO SOLVE WHICH I CAN THINK OF:
> > 
> > Have a distributed store of a file, which captures the bricks which are
> > active.
> > When a NODE goes down, the file is updated with it's replica bricks making
> > sure, at any point in time, the file has all the bricks to be made active.
> > Geo-replication worker process is made 'ACTIVE' only if it is in the file.
> > 
> >  Implementation can be in two ways:
> > 
> >   1. Have a distributed store for above implementation. This needs to be
> >   thought
> >      of as distributed store is not in place in glusterd yet.
> > 
> >   2. Other solution is to store in a file similar to existing glusterd
> >   global
> >      configuration file (/var/lib/glusterd/options). When this file is
> >      updated,
> >      version number is incremented. When the node which is gone down, comes
> >      up,
> >      gets this file from peers if it's version number is less that of
> >      peers.
> > 
> > 
> > I did a POC with second approach storing list of active bricks
> > 'NodeUUID:brickpath'
> > in options file itself. It seems to work fine except the bug in glusterd
> > where the
> > daemons are getting spawned before the node gets 'options' file from other
> > node during
> > handshake.
> > 
> > CHANGES IN GLUSTERD:
> >     When a node goes down, all the other nodes are notified through
> >     glusterd_peer_rpc_notify,
> >   where, it needs to find the replicas of the node which went down and
> >   update
> >   the global
> >   file.
> > 
> > PROBLEMS/LIMITATIONS WITH THIS APPRAOCH:
> >     1. If glusterd is killed and the node is still up, this makes the other
> >     replica 'ACTIVE'.
> >        So both replica bricks will be syncing at this point of time which
> >        is
> >        not expected.
> > 
> >     2. If the single brick process is killed, it's replica brick is not
> >     made
> >     'ACTIVE'.
> > 
> > 
> > Glusterd/AFR folks,
> > 
> >     1. Do you see a better approach other than above to solve this issue?
> >     2. Is this approach feasible? If yes, how can I handle the problems
> >     mentioned above ?
> >     3. Is this approach feasible from scalability point of view since
> >     complete list of active
> >        brick path is stored and read by gsyncd ?
> >     3. Does this approach fits into three way replication and erasure
> >     coding?
> > 
> > 
> > 
> > Thanks and Regards,
> > Kotresh H R
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
>