[Gluster-devel] Snapshot design for glusterfs volumes

Wed Aug 7 05:07:14 UTC 2013

----- Original Message -----

> From: "Brian Foster" <bfoster at redhat.com>
> To: "Shishir Gowda" <sgowda at redhat.com>
> Cc: gluster-devel at nongnu.org
> Sent: Wednesday, August 7, 2013 3:11:30 AM
> Subject: Re: [Gluster-devel] Snapshot design for glusterfs volumes

> On 08/06/2013 12:16 AM, Shishir Gowda wrote:
> > Hi Brian,
> >
> > - A barrier is similar to a throttling mechanism. All it does is queue up
> > the call_backs at the server xlator.
> > Once barrier'ing is done, it just starts unwinding, so that clients can now
> > get the response.
> > The idea is that if a application does not get a acknowledgement back for
> > the fops, it will block for sometime,
> > hence effectively throttling itself.
> >

> Ok, but why the need to stop unwinds? Isn't it just as effective to pend
> the next wind from said process?

Barrier'ing next winds face few problems: 
1. In anonymous fd based op's, we wouldn't be able to identify the right fd. We want to barrier only the fsync call or fd's opened with O_DIRECT | O_SYNC 
2. If writev are barrier'ed, then the buffer's also have to be saved, consuming additional space 
3. By barrier'ing fsync response, we in effect do not allow the clients to continue write. If we ack'ed the fsync, then parallel writev might come through 

> Maybe I missed this from the doc, but it's actually a mechanism _in_ the
> server translator, or an independent translator? I ask because it sounds
> like the latter could potentially provide a general throttle mechanism
> that could be more broadly useful (over time, of course ;) ).

We plan to introduce this change in the server xlator, as this would be the first xlator to be able to identify the fop. 
The same could be used in the future for throttling too, as the lists would be configurable, and any fops can be added/removed. 

> > - Snapshot here guarantees a snap of whatever has been committed onto the
> > disk.
> > So, in effect every internal operation (afr/dht...) should/will have to be
> > able to heal them-selves once
> > the volume restore takes place.
> >

> Given the explanation above to consider the barrier translator as a
> "throttle," then I suspect its primary purpose is for performance
> reasons as opposed to purely functional reasons (i.e., make sure the
> snap operation occurs in a timely fashion)? My inclination when reading
> the document was to consider the barrier mechanism as effectively the
> quiesce portion of the typical snapshot process.

Quiesce xlator (features/quiesce) already exists. We do not want all fops to be quiesced. That would lead to applications 
seeing time-out, and also maintaining the queue would turn out to be a night-mare in i/o intensive work-loads 

> From a black box perspective, it seems a little strange to me that a
> built-in snapshot mechanism wouldn't be coherent with internal
> operations (though from a complexity standpoint I can understand why
> that's put off). Has there been any considerations to try and solve that
> problem?
> That aside and assuming the current model, 1.) is there any assessment
> for the likelihood of that kind of situation assuming a user follows the
> expected process? and 2.) has the effect of that been measured on the
> snapshot mechanism?

Completly concur with you here. We have considered couple of approaches here 
1. Any xlator on the server side will be made aware of a pending snap, thus giving it time for it to do its house-keeping 
2. Doing the above for client side xlators is complex, and not targeted for now. But in the future if we have more control over the number/location of the clients, we should be able to handle that too 
3. Currently most of the xlator ops are recoverable (except dht rename of directory). So at any given point in time, a snap should heal itself. 

> It's been a while since I've played with lvm snapshots and I suppose the
> latest technology does away with the old per-snap exception store. Even
> still, it seems like self-heals running across sets of large,
> inconsistent vm image files (and potentially copying them from one side
> to another) could eat a ton of resource (space and/or cpu), no?

This would kick in only when a snapped volume is restored, and started. In that case we have to heal. 
Or may be alternative is, when mounted as RO, skip healing, and just make it read from the right copy. 
One way to ease our problem here is to make sure a snap is taken only if all the bricks are up. That way self-heal might(not guaranteed) have already completed. 

> Brian

> > With regards,
> > Shishir
> >
> > ----- Original Message -----

> > From: "Brian Foster" <bfoster at redhat.com>
> > To: "Shishir Gowda" <sgowda at redhat.com>
> > Cc: gluster-devel at nongnu.org
> > Sent: Monday, August 5, 2013 6:11:47 PM
> > Subject: Re: [Gluster-devel] Snapshot design for glusterfs volumes
> >
> > On 08/02/2013 02:26 AM, Shishir Gowda wrote:
> >> Hi All,
> >>
> >> We propose to implement snapshot support for glusterfs volumes in
> >> release-3.6.
> >>
> >> Attaching the design document in the mail thread.
> >>
> >> Please feel free to comment/critique.
> >>
> >
> > Hi Shishir,
> >
> > Thanks for posting this. A couple questions:
> >
> > - The stage-1 prepare section suggests that operations are blocked
> > (barrier) in the callback, but later on in the doc it indicates incoming
> > operations would be held up. Does barrier block winds and unwinds, or
> > just winds? Could you elaborate on the logic there?
> >
> > - This is kind of called out in the open issues section with regard to
> > write-behind, but don't we require some kind of operational coherency
> > with regard to cluster translator operations? Is it expected that a
> > snapshot across a cluster of bricks might not be coherent with regard to
> > active afr transactions (and thus potentially require a heal in the
> > snap), for example?
> >
> > Brian
> >
> >> With regards,
> >> Shishir
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at nongnu.org
> >> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >>
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130807/f1de95ca/attachment-0001.html>