[Gluster-devel] GlusterFS Snapshot internals

Paul Cuzner pcuzner at redhat.com
Wed Apr 9 01:50:07 UTC 2014


Thanks again, Rajesh. 

----- Original Message -----

> From: "Rajesh Joseph" <rjoseph at redhat.com>
> To: "Paul Cuzner" <pcuzner at redhat.com>
> Cc: "gluster-devel" <gluster-devel at nongnu.org>
> Sent: Wednesday, 9 April, 2014 12:04:35 AM
> Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> Hi Paul,

> Whenever a brick comes online it performs a handshake with glusterd. The
> brick will not send a notification to
> clients until the handshake is done. We are planning to provide an extension
> to this and recreate those missing snaps.

> Best Regards,
> Rajesh

> ----- Original Message -----
> From: "Paul Cuzner" <pcuzner at redhat.com>
> To: "Rajesh Joseph" <rjoseph at redhat.com>
> Cc: "gluster-devel" <gluster-devel at nongnu.org>
> Sent: Tuesday, April 8, 2014 12:49:13 PM
> Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> Rajesh,

> Perfect explanation - the 'penny has dropped'. I was missing the healing
> process of the snap being based on the snap from the replica.

> One final question - I assume the scenario you mention about the brick coming
> back online before the snapshots are taken is theoretical and there are
> blocks in place to prevent this from happening?

> BTW, I'll get the BZ RFE's in by the end of my week, and will post the BZ's
> back to the list for info.

> Thanks!

> PC

> ----- Original Message -----

> > From: "Rajesh Joseph" <rjoseph at redhat.com>
> > To: "Paul Cuzner" <pcuzner at redhat.com>
> > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> > Sent: Tuesday, 8 April, 2014 5:09:10 PM
> > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> > Hi Paul,

> > It would be great if you can raise RFEs for both snap after restore and
> > snapshot naming.

> > Let's say your volume "Vol" has bricks b1, b2, b3 and b4.

> > @0800 - S1 (snapshot volume) -> s1_b1, s1_b2, s1_b3, s1_b4 (These are
> > respective snap bricks which are on independent thin LVs)

> > @0830 - b1 went down

> > @1000 - S2 (snapshot volume) -> s2_b1, x, s2_b3, s2_b4. Here we mark the
> > brick has pending snapshot.
> > Note that s2_b1 will have all the changes missed by b2 till 1000 hours. AFR
> > will mark the
> > pending changes on s2_b1.

> > @1200 - S3 (Snapshot volume) -> s3_b1, x, s3_b3, s3_b4. This missed
> > snapshot
> > is also recorded.

> > @1400 - S4 (Snapshot volume) -> s4_b1, x, s4_b3, s4_b4. This missed
> > snapshot
> > is also recorded.

> > @1530 - b2 comes back. Before making it online we take snapshot s2_b2,
> > s3_b2
> > and s4_b2. Since all
> > these three snapshots are taken nearly at the same time content-wise all of
> > them would be
> > at the same state. Now these bricks are added to their respective volumes.
> > Note that till
> > now no healing is done. After addition snapshot volumes will look like
> > this:
> > S2 -> s2_b1, s2_b2, s2_b3, s2_b4.
> > S3 -> s3_b1, s3_b2, s3_b3, s3_b4.
> > S4 -> s4_b1, s4_b2, s4_b3, s4_b4.
> > After this b2 will come online, i.e. clients can access this brick. Now S2,
> > S3 and S4 is healed.
> > s2_b2 will get healed from s2_b1, s3_b2 will be healed from s3_b1 and so on
> > and so forth.
> > This healing will take s2_b2 to the point when the snapshot is taken.

> > If the bricks come online before taking these snapshots self heal will try
> > to
> > take the brick (b2) to point closer
> > to the current time (@1530). Therefore it will not be consistent with the
> > other replica-set.

> > Please let me know if you have more questions or clarifications.

> > Best Regards,
> > Rajesh

> > ----- Original Message -----
> > From: "Paul Cuzner" <pcuzner at redhat.com>
> > To: "Rajesh Joseph" <rjoseph at redhat.com>
> > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> > Sent: Tuesday, April 8, 2014 8:01:57 AM
> > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> > Thanks Rajesh.

> > Let me know if I should raise any RFE's - snap after restore, snapshot
> > naming, etc

> > I'm still being thick about the snapshot process with missing bricks. What
> > I'm missing is the heal process between snaps - my assumption is that the
> > snap of a brick needs to be consistent with the other brick snaps within
> > the
> > same replica set. Lets use a home drive use case as an example - typically,
> > I'd expect to see a home directories getting snapped at 0800, 1000,
> > 1200,1400, 1600, 1800, 2200 each day. So in that context, say we have a
> > dist-repl volume with 4 bricks, b1<->b2, b3<->b4;

> > @ 0800 all bricks are available, snap (S1) succeeds with a snap volume
> > being
> > created from all bricks
> > --- files continue to be changed and added
> > @ 0830 b2 is unavailable (D0). Gluster tracks the pending updates on b1,
> > needed to be applied to b2
> > --- files continue to be changed and added.
> > @ 1000 snap requested - 3 of 4 bricks available, snap taken (S2) on b1, b3
> > and b4 - snapvolume activated
> > --- files continue to change
> > @ 1200 a further snap performed - S3
> > --- files continue to change
> > @ 1400 snapshot S4 taken
> > --- files change
> > @ 1530 missing brick 2 comes back online (D1)

> > Now between disruption of D0 and D1 there have been several snaps. My
> > understanding is that each snap should provide a view of the filesystem
> > consistent at the time of the snapshot - correct?

> > You mention
> > + brick2 comes up. At this moment we take a snapshot before we allow new
> > I/O
> > or heal of the brick. We multiple snaps are missed then all the snaps are
> > taken at this time. We don't wait till the brick is brought to the same
> > state as other bricks.
> > + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot volume).
> > Self heal will take of bringing brick2 state to its other replica set.

> > According to this description, if you snapshot b2 as soon as it's back
> > online
> > - that generates S1,S2 and S3 as at 08:30 - and lets self heal bring b2 up
> > to the current time D1. However, doesn't this mean that S1,S2 and S3 on
> > brick2 are not equal to S2,S3,S4 on brick1?

> > If that is right, then if b1 is unavailable the corresponding snapshots on
> > b2
> > wouldn't support the recovery points of 1000,1200 and 1400 - which we know
> > are ok on b1.

> > I guess I'd envisaged snapshots working hand-in-glove with self heal to
> > maintain the snapshot consistency - and may just be stuck on that thought.

> > Maybe this is something I'll only get on whiteboard - wouldn't be the first
> > time :(

> > I appreciate you patience in explaining this recovery process!

> > ----- Original Message -----

> > > From: "Rajesh Joseph" <rjoseph at redhat.com>
> > > To: "Paul Cuzner" <pcuzner at redhat.com>
> > > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> > > Sent: Monday, 7 April, 2014 10:12:53 PM
> > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> > > Thanks Paul for your valuable comments. Please find my comments in-lined
> > > below.

> > > Please let us know if you have more questions or clarifications. I will
> > > try
> > > to update the
> > > doc where ever more clarity is needed.

> > > Thanks & Regards,
> > > Rajesh

> > > ----- Original Message -----
> > > From: "Paul Cuzner" <pcuzner at redhat.com>
> > > To: "Rajesh Joseph" <rjoseph at redhat.com>
> > > Cc: "gluster-devel" <gluster-devel at nongnu.org>
> > > Sent: Monday, April 7, 2014 1:59:10 AM
> > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> > > Hi Rajesh,

> > > Thanks for updating the design doc. It reads well.

> > > I have a number of questions that would help my understanding;

> > > Logging : The doc doesn't mention how the snapshot process is logged -
> > > - will snapshot use an existing log or a new log?
> > > [RJ]: As of now snapshot make use of existing logging framework.
> > > - Will the log be specific to a volume, or will all snapshot activity be
> > > logged in a single file?
> > > [RJ]: Snapshot module is embedded in gluster core framework. Therefore
> > > the
> > > logs will also be part of glusterd logs.
> > > - will the log be visible on all nodes, or just the originating node?
> > > [RJ]: Similar to glusterd snapshot logs related to each node will be
> > > visible
> > > in those nodes.
> > > - will the highlevel snapshot action be visible when looking from the
> > > other
> > > nodes either in the logs or at the cli?
> > > [RJ]: As of now highlevel snapshot action will be visible only in the
> > > logs
> > > of
> > > originator node. Though cli can be used see
> > > list and info of snapshots from any other nodes.

> > > Restore : You mention that after a restore operation, the snapshot will
> > > be
> > > automatically deleted.
> > > - I don't believe this is a prudent thing to do. Here's an example, I've
> > > seen
> > > ALOT. Application has a programmatic error, leading to data 'corruption'
> > > -
> > > devs work on the program, storage guys roll the volume back. So far so
> > > good...devs provide the updated program, and away you go...BUT the issue
> > > is
> > > not resolved, so you need to roll back again to the same point in time.
> > > If
> > > you delete the snap automatically, you loose the restore point. Yes the
> > > admin could take another snap after the restore - but why add more work
> > > into
> > > a recovery process where people are already stressed out :) I'd recommend
> > > leaving the snapshot if possible, and let it age out naturally.
> > > [RJ]: Snapshot restore is a simple operation wherein volume bricks will
> > > simply point to the brick snapshot instead of the original brick.
> > > Therefore
> > > once the restore is done we cannot use the same snapshot again. We are
> > > planning to implement a configurable option which will automatically take
> > > snapshot of the snapshot to fulfill the above mentioned requirement. But
> > > with the given timeline and resources we will not be able to target it in
> > > the coming release.

> > > Auto-delete : Is this a post phase of the snapshot create, so the
> > > successfully creation of a new snapshot will trigger the pruning of old
> > > versions?
> > > [RJ] Yes, if we reach the snapshot limit for a volume then the snapshot
> > > create operation will trigger pruning of older snapshots.

> > > Snapshot Naming : The doc states the name is mandatory.
> > > - why not offer a default - volume_name_timestamp - instead of making the
> > > caller decide on a name. Having this as a default will also make the list
> > > under .snap more usable by default.
> > > - providing a sensible default will make it easier for end users for self
> > > service restore. More sensible defaults = more happy admins :)
> > > [RJ]: This is a good to have feature we will try to incorporate this in
> > > the
> > > next release.

> > > Quorum and snaprestore : the doc mentions that when a returning brick
> > > comes
> > > back, it will be snap'd before pending changes are applied. If I
> > > understand
> > > the use of quorum correctly, can you comment on the following scenario;
> > > - With a brick offline, we'll be tracking changes. Say after 1hr a snap
> > > is
> > > invoked because quorum is met
> > > - changes continue on the volume for another 15 minutes beyond the snap,
> > > when
> > > the offline brick comes back online.
> > > - at this point there are two point in times to bring the brick back to -
> > > the
> > > brick needs the changes up to the point of the snap, then a snap of the
> > > brick followed by the 'replay' of the additional changes to get back to
> > > the
> > > same point in time as the other replica's in the replica set.
> > > - of course, the brick could be offline for 24 or 48 hours due to a
> > > hardware
> > > fault - during which time multiple snapshots could have been made
> > > - it wasn't clear to me how this scenario is dealt with from the doc?
> > > [RJ]: Following action is taken in case we miss a snapshot on brick.
> > > + Lets say brick2 is down while taking snapshot s1.
> > > + Snapshot s1 will be taken for all the bricks except brick2. Will update
> > > the
> > > bookkeeping about the missed activity.
> > > + I/O can continue to happen on origin volume.
> > > + brick2 comes up. At this moment we take a snapshot before we allow new
> > > I/O
> > > or heal of the brick. We multiple snaps are missed then all the snaps are
> > > taken at this time. We don't wait till the brick is brought to the same
> > > state as other bricks.
> > > + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot
> > > volume).
> > > Self heal will take of bringing brick2 state to its other replica set.

> > > barrier : two things are mentioned here - a buffer size and a timeout
> > > value.
> > > - from an admin's pespective, being able to specify the timeout (secs) is
> > > likely to be more workable - and will allow them to align this setting
> > > with
> > > any potential timeout setting within the application running against the
> > > gluster volume. I don't think most admins will know or want to know how
> > > to
> > > size the buffer properly.
> > > [RJ]: In the current release we are only providing the timeout value as a
> > > configurable option. The buffer size is being considered for future
> > > release
> > > as configurable option or we find our-self what would be the optimal
> > > value
> > > based on user's system configuration.

> > > Hopefully the above makes sense.

> > > Cheers,

> > > Paul C

> > > ----- Original Message -----

> > > > From: "Rajesh Joseph" <rjoseph at redhat.com>
> > > > To: "gluster-devel" <gluster-devel at nongnu.org>
> > > > Sent: Wednesday, 2 April, 2014 3:55:28 AM
> > > > Subject: [Gluster-devel] GlusterFS Snapshot internals

> > > > Hi all,

> > > > I have updated the GlusterFS snapshot forge wiki.

> > > > https://forge.gluster.org/snapshot/pages/Home

> > > > Please go through it and let me know if you have any questions or
> > > > queries.

> > > > Best Regards,
> > > > Rajesh

> > > > [PS]: Please ignore previous mail. Accidentally hit send before
> > > > completing
> > > > :)

> > > > _______________________________________________
> > > > Gluster-devel mailing list
> > > > Gluster-devel at nongnu.org
> > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140408/b5fb1c81/attachment-0001.html>


More information about the Gluster-devel mailing list