[Gluster-devel] compound fop design first cut
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Jan 6 16:46:51 UTC 2016
On 01/06/2016 07:50 PM, Jeff Darcy wrote:
>> 1) fops will be compounded per inode, meaning 2 fops on different
>> inodes can't be compounded (Not because of the design, Just reducing
>> scope of the problem).
>>
>> 2) Each xlator that wants a compound fop packs the arguments by
>> itself.
> Packed how? Are we talking about XDR here, or something else? How is
> dict_t handled? Will there be generic packing/unpacking code somewhere,
> or is each translator expected to do this manually?
Packed as mentioned in step-4 below. There will be common functions
provided which will fill an array cell with the given information to the
function for that fop. In conjunction to that there will be filling
functions for each of the compound fops listed at:
https://public.pad.fsfe.org/p/glusterfs-compound-fops. XDR should be
similar to what Soumya suggested in earlier mails just like in NFS.
>
>> 3) On the server side a de-compounder placed below server xlator
>> unpacks the arguments and does the necessary operations.
>>
>> 4) Arguments for compound fops will be passed as array of union of
>> structures where each structure is associated with a fop.
>>
>> 5) Each xlator will have <xlator>_compound_fop () which receives the
>> fop and does additional processing that is required for itself.
> What happens when (not if) some translator fails to provide this? Is
> there a default function? Is there something at the end of the chain
> that will log an error if the fop gets that far without being handled
> (as with GF_FOP_IPC)?
Yes there will be default_fop provided just like other fops which is
just a pass through. Posix will log unwind with -1, ENOTSUPP.
>
>> 6) Response will also be an array of union of response structures
>> where each structure is associated with a fop's response.
> What are the error semantics? Does processing of a series always stop
> at the first error, or are there some errors that allow retry/continue?
> If/when processing stops, who's responsible for cleaning up state
> changed by those parts that succeeded? What happens if the connection
> dies in the middle?
Yes, at the moment we are implementing stop at first error semantics as
it seems to satisfy all the compound fops we listed @
https://public.pad.fsfe.org/p/glusterfs-compound-fops. Each translator
which looks to handle the compound fop should handle failures just like
they do for normal fop at the moment.
>
> How are values returned from one operation in a series propagated as
> arguments for the next?
They are not. In the first cut the only dependency between two fops now
is whether the previous one succeeded or not. Just this much seems to
work fine for the fops we are targeting for now:
https://public.pad.fsfe.org/p/glusterfs-compound-fops, We may have to
enhance it in future based on what will come up in the future.
>
> What are the implications for buffer and message sizes? What are the
> limits on how large these can get, and/or how many operations can be
> compounded?
It depends on the limits imposed by rpc layer. If it can't send the
request, the fop will fail. If it can send the request but the response
is too big to send back, I think the fop will lead to error by frame
timeout for the response. Either way it will be a failure. At the moment
for the fops listed at:
https://public.pad.fsfe.org/p/glusterfs-compound-fops this doesn't seem
to be a problem.
>
> How is synchronization handled? Is the inode locked for the duration of
> the compound operation, to prevent other operations from changing the
> context in which later parts of the compound operation execute? Are
> there possibilities for deadlock here? Alternatively, if no locking is
> done, are we going to document the fact that compound operations are not
> atomic/linearizable?
Since we are limiting the scope to single inode fops, locking should
suffice. EC doesn't have any problem as it just has one lock for both
data/entry, metadata locks. In afr we need to come up with locking order
for metadata, data domains. Something similar to what we do in rename
where we need to take multiple locks.
Pranith
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list