[Gluster-devel] GlusterFS Snapshot internals

Paul Cuzner pcuzner at redhat.com
Fri Apr 11 02:38:15 UTC 2014


BZ's raised for snapshots 

1086493 - rfe requesting a default name for snapshots ( https://bugzilla.redhat.com/show_bug.cgi?id=1086493 ) 
1086497 - rfe requesting the snapshot after snap restore function ( https://bugzilla.redhat.com/show_bug.cgi?id=1086497 ) 

PC 

----- Original Message -----

> From: "Paul Cuzner" <pcuzner at redhat.com>
> To: "Rajesh Joseph" <rjoseph at redhat.com>
> Cc: "gluster-devel" <gluster-devel at nongnu.org>
> Sent: Wednesday, 9 April, 2014 1:50:07 PM
> Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> Thanks again, Rajesh.

> ----- Original Message -----

> > From: "Rajesh Joseph" <rjoseph at redhat.com>
> 
> > To: "Paul Cuzner" <pcuzner at redhat.com>
> 
> > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> 
> > Sent: Wednesday, 9 April, 2014 12:04:35 AM
> 
> > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals
> 

> > Hi Paul,
> 

> > Whenever a brick comes online it performs a handshake with glusterd. The
> > brick will not send a notification to
> 
> > clients until the handshake is done. We are planning to provide an
> > extension
> > to this and recreate those missing snaps.
> 

> > Best Regards,
> 
> > Rajesh
> 

> > ----- Original Message -----
> 
> > From: "Paul Cuzner" <pcuzner at redhat.com>
> 
> > To: "Rajesh Joseph" <rjoseph at redhat.com>
> 
> > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> 
> > Sent: Tuesday, April 8, 2014 12:49:13 PM
> 
> > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals
> 

> > Rajesh,
> 

> > Perfect explanation - the 'penny has dropped'. I was missing the healing
> > process of the snap being based on the snap from the replica.
> 

> > One final question - I assume the scenario you mention about the brick
> > coming
> > back online before the snapshots are taken is theoretical and there are
> > blocks in place to prevent this from happening?
> 

> > BTW, I'll get the BZ RFE's in by the end of my week, and will post the BZ's
> > back to the list for info.
> 

> > Thanks!
> 

> > PC
> 

> > ----- Original Message -----
> 

> > > From: "Rajesh Joseph" <rjoseph at redhat.com>
> 
> > > To: "Paul Cuzner" <pcuzner at redhat.com>
> 
> > > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> 
> > > Sent: Tuesday, 8 April, 2014 5:09:10 PM
> 
> > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals
> 

> > > Hi Paul,
> 

> > > It would be great if you can raise RFEs for both snap after restore and
> 
> > > snapshot naming.
> 

> > > Let's say your volume "Vol" has bricks b1, b2, b3 and b4.
> 

> > > @0800 - S1 (snapshot volume) -> s1_b1, s1_b2, s1_b3, s1_b4 (These are
> 
> > > respective snap bricks which are on independent thin LVs)
> 

> > > @0830 - b1 went down
> 

> > > @1000 - S2 (snapshot volume) -> s2_b1, x, s2_b3, s2_b4. Here we mark the
> 
> > > brick has pending snapshot.
> 
> > > Note that s2_b1 will have all the changes missed by b2 till 1000 hours.
> > > AFR
> 
> > > will mark the
> 
> > > pending changes on s2_b1.
> 

> > > @1200 - S3 (Snapshot volume) -> s3_b1, x, s3_b3, s3_b4. This missed
> > > snapshot
> 
> > > is also recorded.
> 

> > > @1400 - S4 (Snapshot volume) -> s4_b1, x, s4_b3, s4_b4. This missed
> > > snapshot
> 
> > > is also recorded.
> 

> > > @1530 - b2 comes back. Before making it online we take snapshot s2_b2,
> > > s3_b2
> 
> > > and s4_b2. Since all
> 
> > > these three snapshots are taken nearly at the same time content-wise all
> > > of
> 
> > > them would be
> 
> > > at the same state. Now these bricks are added to their respective
> > > volumes.
> 
> > > Note that till
> 
> > > now no healing is done. After addition snapshot volumes will look like
> > > this:
> 
> > > S2 -> s2_b1, s2_b2, s2_b3, s2_b4.
> 
> > > S3 -> s3_b1, s3_b2, s3_b3, s3_b4.
> 
> > > S4 -> s4_b1, s4_b2, s4_b3, s4_b4.
> 
> > > After this b2 will come online, i.e. clients can access this brick. Now
> > > S2,
> 
> > > S3 and S4 is healed.
> 
> > > s2_b2 will get healed from s2_b1, s3_b2 will be healed from s3_b1 and so
> > > on
> 
> > > and so forth.
> 
> > > This healing will take s2_b2 to the point when the snapshot is taken.
> 

> > > If the bricks come online before taking these snapshots self heal will
> > > try
> > > to
> 
> > > take the brick (b2) to point closer
> 
> > > to the current time (@1530). Therefore it will not be consistent with the
> 
> > > other replica-set.
> 

> > > Please let me know if you have more questions or clarifications.
> 

> > > Best Regards,
> 
> > > Rajesh
> 

> > > ----- Original Message -----
> 
> > > From: "Paul Cuzner" <pcuzner at redhat.com>
> 
> > > To: "Rajesh Joseph" <rjoseph at redhat.com>
> 
> > > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> 
> > > Sent: Tuesday, April 8, 2014 8:01:57 AM
> 
> > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals
> 

> > > Thanks Rajesh.
> 

> > > Let me know if I should raise any RFE's - snap after restore, snapshot
> 
> > > naming, etc
> 

> > > I'm still being thick about the snapshot process with missing bricks.
> > > What
> 
> > > I'm missing is the heal process between snaps - my assumption is that the
> 
> > > snap of a brick needs to be consistent with the other brick snaps within
> > > the
> 
> > > same replica set. Lets use a home drive use case as an example -
> > > typically,
> 
> > > I'd expect to see a home directories getting snapped at 0800, 1000,
> 
> > > 1200,1400, 1600, 1800, 2200 each day. So in that context, say we have a
> 
> > > dist-repl volume with 4 bricks, b1<->b2, b3<->b4;
> 

> > > @ 0800 all bricks are available, snap (S1) succeeds with a snap volume
> > > being
> 
> > > created from all bricks
> 
> > > --- files continue to be changed and added
> 
> > > @ 0830 b2 is unavailable (D0). Gluster tracks the pending updates on b1,
> 
> > > needed to be applied to b2
> 
> > > --- files continue to be changed and added.
> 
> > > @ 1000 snap requested - 3 of 4 bricks available, snap taken (S2) on b1,
> > > b3
> 
> > > and b4 - snapvolume activated
> 
> > > --- files continue to change
> 
> > > @ 1200 a further snap performed - S3
> 
> > > --- files continue to change
> 
> > > @ 1400 snapshot S4 taken
> 
> > > --- files change
> 
> > > @ 1530 missing brick 2 comes back online (D1)
> 

> > > Now between disruption of D0 and D1 there have been several snaps. My
> 
> > > understanding is that each snap should provide a view of the filesystem
> 
> > > consistent at the time of the snapshot - correct?
> 

> > > You mention
> 
> > > + brick2 comes up. At this moment we take a snapshot before we allow new
> > > I/O
> 
> > > or heal of the brick. We multiple snaps are missed then all the snaps are
> 
> > > taken at this time. We don't wait till the brick is brought to the same
> 
> > > state as other bricks.
> 
> > > + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot
> > > volume).
> 
> > > Self heal will take of bringing brick2 state to its other replica set.
> 

> > > According to this description, if you snapshot b2 as soon as it's back
> > > online
> 
> > > - that generates S1,S2 and S3 as at 08:30 - and lets self heal bring b2
> > > up
> 
> > > to the current time D1. However, doesn't this mean that S1,S2 and S3 on
> 
> > > brick2 are not equal to S2,S3,S4 on brick1?
> 

> > > If that is right, then if b1 is unavailable the corresponding snapshots
> > > on
> > > b2
> 
> > > wouldn't support the recovery points of 1000,1200 and 1400 - which we
> > > know
> 
> > > are ok on b1.
> 

> > > I guess I'd envisaged snapshots working hand-in-glove with self heal to
> 
> > > maintain the snapshot consistency - and may just be stuck on that
> > > thought.
> 

> > > Maybe this is something I'll only get on whiteboard - wouldn't be the
> > > first
> 
> > > time :(
> 

> > > I appreciate you patience in explaining this recovery process!
> 

> > > ----- Original Message -----
> 

> > > > From: "Rajesh Joseph" <rjoseph at redhat.com>
> 
> > > > To: "Paul Cuzner" <pcuzner at redhat.com>
> 
> > > > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> 
> > > > Sent: Monday, 7 April, 2014 10:12:53 PM
> 
> > > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals
> 

> > > > Thanks Paul for your valuable comments. Please find my comments
> > > > in-lined
> 
> > > > below.
> 

> > > > Please let us know if you have more questions or clarifications. I will
> > > > try
> 
> > > > to update the
> 
> > > > doc where ever more clarity is needed.
> 

> > > > Thanks & Regards,
> 
> > > > Rajesh
> 

> > > > ----- Original Message -----
> 
> > > > From: "Paul Cuzner" <pcuzner at redhat.com>
> 
> > > > To: "Rajesh Joseph" <rjoseph at redhat.com>
> 
> > > > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> 
> > > > Sent: Monday, April 7, 2014 1:59:10 AM
> 
> > > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals
> 

> > > > Hi Rajesh,
> 

> > > > Thanks for updating the design doc. It reads well.
> 

> > > > I have a number of questions that would help my understanding;
> 

> > > > Logging : The doc doesn't mention how the snapshot process is logged -
> 
> > > > - will snapshot use an existing log or a new log?
> 
> > > > [RJ]: As of now snapshot make use of existing logging framework.
> 
> > > > - Will the log be specific to a volume, or will all snapshot activity
> > > > be
> 
> > > > logged in a single file?
> 
> > > > [RJ]: Snapshot module is embedded in gluster core framework. Therefore
> > > > the
> 
> > > > logs will also be part of glusterd logs.
> 
> > > > - will the log be visible on all nodes, or just the originating node?
> 
> > > > [RJ]: Similar to glusterd snapshot logs related to each node will be
> 
> > > > visible
> 
> > > > in those nodes.
> 
> > > > - will the highlevel snapshot action be visible when looking from the
> > > > other
> 
> > > > nodes either in the logs or at the cli?
> 
> > > > [RJ]: As of now highlevel snapshot action will be visible only in the
> > > > logs
> 
> > > > of
> 
> > > > originator node. Though cli can be used see
> 
> > > > list and info of snapshots from any other nodes.
> 

> > > > Restore : You mention that after a restore operation, the snapshot will
> > > > be
> 
> > > > automatically deleted.
> 
> > > > - I don't believe this is a prudent thing to do. Here's an example,
> > > > I've
> 
> > > > seen
> 
> > > > ALOT. Application has a programmatic error, leading to data
> > > > 'corruption'
> > > > -
> 
> > > > devs work on the program, storage guys roll the volume back. So far so
> 
> > > > good...devs provide the updated program, and away you go...BUT the
> > > > issue
> > > > is
> 
> > > > not resolved, so you need to roll back again to the same point in time.
> > > > If
> 
> > > > you delete the snap automatically, you loose the restore point. Yes the
> 
> > > > admin could take another snap after the restore - but why add more work
> 
> > > > into
> 
> > > > a recovery process where people are already stressed out :) I'd
> > > > recommend
> 
> > > > leaving the snapshot if possible, and let it age out naturally.
> 
> > > > [RJ]: Snapshot restore is a simple operation wherein volume bricks will
> 
> > > > simply point to the brick snapshot instead of the original brick.
> > > > Therefore
> 
> > > > once the restore is done we cannot use the same snapshot again. We are
> 
> > > > planning to implement a configurable option which will automatically
> > > > take
> 
> > > > snapshot of the snapshot to fulfill the above mentioned requirement.
> > > > But
> 
> > > > with the given timeline and resources we will not be able to target it
> > > > in
> 
> > > > the coming release.
> 

> > > > Auto-delete : Is this a post phase of the snapshot create, so the
> 
> > > > successfully creation of a new snapshot will trigger the pruning of old
> 
> > > > versions?
> 
> > > > [RJ] Yes, if we reach the snapshot limit for a volume then the snapshot
> 
> > > > create operation will trigger pruning of older snapshots.
> 

> > > > Snapshot Naming : The doc states the name is mandatory.
> 
> > > > - why not offer a default - volume_name_timestamp - instead of making
> > > > the
> 
> > > > caller decide on a name. Having this as a default will also make the
> > > > list
> 
> > > > under .snap more usable by default.
> 
> > > > - providing a sensible default will make it easier for end users for
> > > > self
> 
> > > > service restore. More sensible defaults = more happy admins :)
> 
> > > > [RJ]: This is a good to have feature we will try to incorporate this in
> > > > the
> 
> > > > next release.
> 

> > > > Quorum and snaprestore : the doc mentions that when a returning brick
> > > > comes
> 
> > > > back, it will be snap'd before pending changes are applied. If I
> > > > understand
> 
> > > > the use of quorum correctly, can you comment on the following scenario;
> 
> > > > - With a brick offline, we'll be tracking changes. Say after 1hr a snap
> > > > is
> 
> > > > invoked because quorum is met
> 
> > > > - changes continue on the volume for another 15 minutes beyond the
> > > > snap,
> 
> > > > when
> 
> > > > the offline brick comes back online.
> 
> > > > - at this point there are two point in times to bring the brick back to
> > > > -
> 
> > > > the
> 
> > > > brick needs the changes up to the point of the snap, then a snap of the
> 
> > > > brick followed by the 'replay' of the additional changes to get back to
> > > > the
> 
> > > > same point in time as the other replica's in the replica set.
> 
> > > > - of course, the brick could be offline for 24 or 48 hours due to a
> 
> > > > hardware
> 
> > > > fault - during which time multiple snapshots could have been made
> 
> > > > - it wasn't clear to me how this scenario is dealt with from the doc?
> 
> > > > [RJ]: Following action is taken in case we miss a snapshot on brick.
> 
> > > > + Lets say brick2 is down while taking snapshot s1.
> 
> > > > + Snapshot s1 will be taken for all the bricks except brick2. Will
> > > > update
> 
> > > > the
> 
> > > > bookkeeping about the missed activity.
> 
> > > > + I/O can continue to happen on origin volume.
> 
> > > > + brick2 comes up. At this moment we take a snapshot before we allow
> > > > new
> 
> > > > I/O
> 
> > > > or heal of the brick. We multiple snaps are missed then all the snaps
> > > > are
> 
> > > > taken at this time. We don't wait till the brick is brought to the same
> 
> > > > state as other bricks.
> 
> > > > + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot
> > > > volume).
> 
> > > > Self heal will take of bringing brick2 state to its other replica set.
> 

> > > > barrier : two things are mentioned here - a buffer size and a timeout
> 
> > > > value.
> 
> > > > - from an admin's pespective, being able to specify the timeout (secs)
> > > > is
> 
> > > > likely to be more workable - and will allow them to align this setting
> > > > with
> 
> > > > any potential timeout setting within the application running against
> > > > the
> 
> > > > gluster volume. I don't think most admins will know or want to know how
> > > > to
> 
> > > > size the buffer properly.
> 
> > > > [RJ]: In the current release we are only providing the timeout value as
> > > > a
> 
> > > > configurable option. The buffer size is being considered for future
> > > > release
> 
> > > > as configurable option or we find our-self what would be the optimal
> > > > value
> 
> > > > based on user's system configuration.
> 

> > > > Hopefully the above makes sense.
> 

> > > > Cheers,
> 

> > > > Paul C
> 

> > > > ----- Original Message -----
> 

> > > > > From: "Rajesh Joseph" <rjoseph at redhat.com>
> 
> > > > > To: "gluster-devel" <gluster-devel at nongnu.org>
> 
> > > > > Sent: Wednesday, 2 April, 2014 3:55:28 AM
> 
> > > > > Subject: [Gluster-devel] GlusterFS Snapshot internals
> 

> > > > > Hi all,
> 

> > > > > I have updated the GlusterFS snapshot forge wiki.
> 

> > > > > https://forge.gluster.org/snapshot/pages/Home
> 

> > > > > Please go through it and let me know if you have any questions or
> 
> > > > > queries.
> 

> > > > > Best Regards,
> 
> > > > > Rajesh
> 

> > > > > [PS]: Please ignore previous mail. Accidentally hit send before
> 
> > > > > completing
> 
> > > > > :)
> 

> > > > > _______________________________________________
> 
> > > > > Gluster-devel mailing list
> 
> > > > > Gluster-devel at nongnu.org
> 
> > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> 

> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140410/80f39a85/attachment-0001.html>


More information about the Gluster-devel mailing list