[Gluster-devel] Snapshot design for glusterfs volumes
Brian Foster
bfoster at redhat.com
Wed Aug 7 22:09:45 UTC 2013
On 08/07/2013 01:07 AM, Shishir Gowda wrote:
> ----- Original Message -----
>
>> From: "Brian Foster" <bfoster at redhat.com>
>> To: "Shishir Gowda" <sgowda at redhat.com>
>> Cc: gluster-devel at nongnu.org
>> Sent: Wednesday, August 7, 2013 3:11:30 AM
>> Subject: Re: [Gluster-devel] Snapshot design for glusterfs volumes
>
>> On 08/06/2013 12:16 AM, Shishir Gowda wrote:
>>> Hi Brian,
>>>
>>> - A barrier is similar to a throttling mechanism. All it does is queue up
>>> the call_backs at the server xlator.
>>> Once barrier'ing is done, it just starts unwinding, so that clients can now
>>> get the response.
>>> The idea is that if a application does not get a acknowledgement back for
>>> the fops, it will block for sometime,
>>> hence effectively throttling itself.
>>>
>
>> Ok, but why the need to stop unwinds? Isn't it just as effective to pend
>> the next wind from said process?
>
> Barrier'ing next winds face few problems:
> 1. In anonymous fd based op's, we wouldn't be able to identify the right fd. We want to barrier only the fsync call or fd's opened with O_DIRECT | O_SYNC
> 2. If writev are barrier'ed, then the buffer's also have to be saved, consuming additional space
> 3. By barrier'ing fsync response, we in effect do not allow the clients to continue write. If we ack'ed the fsync, then parallel writev might come through
>
If you don't have the data required to make decisions about blocking
until unwind (#1), then I suppose that makes sense. #3 seems like its
true from blocking on a wind just the same (why would you fsync then
write? or provided you did, why wouldn't you expect another fsync?).
Perhaps I'm still missing something, but the larger point is we probably
shouldn't rely too much on expectations of application behavior. ;)
>> Maybe I missed this from the doc, but it's actually a mechanism _in_ the
>> server translator, or an independent translator? I ask because it sounds
>> like the latter could potentially provide a general throttle mechanism
>> that could be more broadly useful (over time, of course ;) ).
>
> We plan to introduce this change in the server xlator, as this would be the first xlator to be able to identify the fop.
> The same could be used in the future for throttling too, as the lists would be configurable, and any fops can be added/removed.
>
Any reason we couldn't put a throttle translator right after
protocol/server, even if just for modularity?
>>> - Snapshot here guarantees a snap of whatever has been committed onto the
>>> disk.
>>> So, in effect every internal operation (afr/dht...) should/will have to be
>>> able to heal them-selves once
>>> the volume restore takes place.
>>>
>
>> Given the explanation above to consider the barrier translator as a
>> "throttle," then I suspect its primary purpose is for performance
>> reasons as opposed to purely functional reasons (i.e., make sure the
>> snap operation occurs in a timely fashion)? My inclination when reading
>> the document was to consider the barrier mechanism as effectively the
>> quiesce portion of the typical snapshot process.
>
> Quiesce xlator (features/quiesce) already exists. We do not want all fops to be quiesced. That would lead to applications
> seeing time-out, and also maintaining the queue would turn out to be a night-mare in i/o intensive work-loads
>
Oh, interesting. I wasn't aware of that. I'm not following the concern
over quiesce though. That's precisely what the local filesystem is
going to do when you freeze (though it's not so much a queue as just
blocking all writing threads).
Isn't there also the assumption that for the snap to be useful, the user
would have to pause or stop things like active VMs anyway?
>> From a black box perspective, it seems a little strange to me that a
>> built-in snapshot mechanism wouldn't be coherent with internal
>> operations (though from a complexity standpoint I can understand why
>> that's put off). Has there been any considerations to try and solve that
>> problem?
>> That aside and assuming the current model, 1.) is there any assessment
>> for the likelihood of that kind of situation assuming a user follows the
>> expected process? and 2.) has the effect of that been measured on the
>> snapshot mechanism?
>
> Completly concur with you here. We have considered couple of approaches here
> 1. Any xlator on the server side will be made aware of a pending snap, thus giving it time for it to do its house-keeping
> 2. Doing the above for client side xlators is complex, and not targeted for now. But in the future if we have more control over the number/location of the clients, we should be able to handle that too
> 3. Currently most of the xlator ops are recoverable (except dht rename of directory). So at any given point in time, a snap should heal itself.
>
Understood. Is there anything we could do to provide the same kind of
"best effort" approach to clients (e.g., a reconfigure as done for the
server)? Even if there is still no guaranteed synchronization, I think
anything that helps maximize the chances of coherency improves the
experience.
>> It's been a while since I've played with lvm snapshots and I suppose the
>> latest technology does away with the old per-snap exception store. Even
>> still, it seems like self-heals running across sets of large,
>> inconsistent vm image files (and potentially copying them from one side
>> to another) could eat a ton of resource (space and/or cpu), no?
>
> This would kick in only when a snapped volume is restored, and started. In that case we have to heal.
> Or may be alternative is, when mounted as RO, skip healing, and just make it read from the right copy.
> One way to ease our problem here is to make sure a snap is taken only if all the bricks are up. That way self-heal might(not guaranteed) have already completed.
>
Right, it's only an issue when somebody wants to actually use that
snap. ;) The RO situation makes sense, maybe even consider never running
a heal on a snap unless explicitly requested via cli, to try and avoid
the issue?
That might alleviate the problem in the basic case. E.g., we would
probably want to avoid that kind of behavior due to somebody simply
trolling through the most recent 10-15 snapshots or so looking for an
old version of something.
Brian
>> Brian
>
>>> With regards,
>>> Shishir
>>>
>>> ----- Original Message -----
>
>>> From: "Brian Foster" <bfoster at redhat.com>
>>> To: "Shishir Gowda" <sgowda at redhat.com>
>>> Cc: gluster-devel at nongnu.org
>>> Sent: Monday, August 5, 2013 6:11:47 PM
>>> Subject: Re: [Gluster-devel] Snapshot design for glusterfs volumes
>>>
>>> On 08/02/2013 02:26 AM, Shishir Gowda wrote:
>>>> Hi All,
>>>>
>>>> We propose to implement snapshot support for glusterfs volumes in
>>>> release-3.6.
>>>>
>>>> Attaching the design document in the mail thread.
>>>>
>>>> Please feel free to comment/critique.
>>>>
>>>
>>> Hi Shishir,
>>>
>>> Thanks for posting this. A couple questions:
>>>
>>> - The stage-1 prepare section suggests that operations are blocked
>>> (barrier) in the callback, but later on in the doc it indicates incoming
>>> operations would be held up. Does barrier block winds and unwinds, or
>>> just winds? Could you elaborate on the logic there?
>>>
>>> - This is kind of called out in the open issues section with regard to
>>> write-behind, but don't we require some kind of operational coherency
>>> with regard to cluster translator operations? Is it expected that a
>>> snapshot across a cluster of bricks might not be coherent with regard to
>>> active afr transactions (and thus potentially require a heal in the
>>> snap), for example?
>>>
>>> Brian
>>>
>>>> With regards,
>>>> Shishir
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>
>>>
>>>
>
More information about the Gluster-devel
mailing list