[Gluster-devel] Some questions about requisites of translators

Xavier Hernandez xhernandez at datalab.es
Mon May 7 08:07:52 UTC 2012


On 05/05/2012 08:02 AM, Anand Avati wrote:
>
>
> On Wed, May 2, 2012 at 3:55 AM, Xavier Hernandez 
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>
>     Hello,
>
>     I'm wondering if there are any requisites that translators must
>     satisfy to work correctly inside glusterfs.
>
>     In particular I need to know two things:
>
>     1. Are translators required to respect the order in which they
>     receive the requests ?
>
>     This is specially important in translators such as
>     performance/io-threads or caching ones. It seems that these
>     translators can reorder requests. If this is the case, is there
>     any way to force some order between requests ? can inodelk/entrylk
>     be used to force the order ?
>
>
> Translators are not expected to maintain ordering of requests. The 
> only translator which takes care of ordering calls is write-behind. 
> After acknowledging back write requests it has to make sure future 
> requests see the true "effect" as though the previous write actually 
> completed. To that end, it queues future "dependent" requests till the 
> write acknowledgement is received from the server.
>
> inodelk/entrylk calls help achieve synchronization among clients (by 
> getting into a critical section) - just like a mutex. It is an 
> arbitrator. It does not help for ordering of two calls. If one call 
> must strictly complete after another call from your translator's point 
> of view (i.e, if it has such a requirement), then the latter call's 
> STACK_WIND must happen in the callback of the former's STACK_UNWIND 
> path. There are no guarantees maintained by the system to ensure that 
> a second STACK_WIND issued right after a first STACK_WIND will 
> complete and callback in the same order. Write-behind does all its 
> ordering gimmicks only because it STACK_UNWINDs a write call 
> prematurely and therefore must maintain the causal effects by means of 
> queueing new requests behind the downcall towards the server.
Good to know

>     2. Are translators required to propagate callback arguments even
>     if the result of the operation is an error ? and if an internal
>     translator error occurs ?
>
>
> Usually no. If op_ret is -1, only op_errno is expected to be a usable 
> value. Rest of the callback parameters are junk.
>
>     When a translator has multiple subvolumes, I've seen that some
>     arguments, such as xdata, are replaced with NULL. This can be
>     understood, but are regular translators (those that only have one
>     subvolume) allowed to do that or must they preserve the value of
>     xdata, even in the case of an internal error ?
>
>
> It is best to preserve the arguments unless you know specifically what 
> you are doing. In case of error, all the non-op_{ret,errno} arguments 
> are typically junk, including xdata.
>
>     If this is not a requisite, xdata loses it's function of
>     delivering back extra information.
>
>
> Can you explain? Are you seeing a use case for having a valid xdata in 
> the callback even with op_ret == -1?
>
As a part of a translator that I'm developing that works with multiple 
subvolumes, I need to implement some healing support to mantain data 
coherency (similar to AFR). After some thought, I decided that it could 
be advantageous to use a dedicated healing translator located near the 
bottom of the translators stack on the servers. This translator won't 
work by itself, it only adds support to be used by a higher level 
translator, which have to manage the logic of the healing and decide 
when a node needs to be healed.

To do this, sometimes I need to return an error because an operation 
cannot be completed due to some condition related with healing itself 
(not with the underlying storage). However I need to send some specific 
healing information to let the upper translator know how it has to 
handle the detected condition.

I cannot send a success answer because intermediate translators could 
take the fake data as valid and they could begin to operate incorrectly 
or even create inconsistencies. The other alternative is to use op_errno 
to encode the extra data, but this will also be difficult, even 
impossible in some cases, due to the amount of data and the complexity 
to combine it with an error code without mislead intermediate 
translators with strange or invalid error codes.

I talked with John Mark about this translator and he suggested me to 
discuss it over the list. Therefore I'll initiate another thread to 
expose in more detail how it works and I would appreciate very much your 
opinion, and that of the other developers, about it. Especially if it 
can really be faster/safer that other solutions or not, or if you find 
any problem or have any suggestion to improve it. I think it could also 
be used by AFR and any future translator that may need some healing 
capabilities.

Thank you very much,

Xavi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20120507/4f0e5a58/attachment-0003.html>


More information about the Gluster-devel mailing list