[Gluster-devel] compound fop design first cut
Pranith Kumar Karampuri
pkarampu at redhat.com
Tue Dec 8 07:18:55 UTC 2015
On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote:
>
>
> On 12/08/2015 02:53 AM, Shyam wrote:
>> Hi,
>>
>> Why not think along the lines of new FOPs like fop_compound(_cbk)
>> where, the inargs to this FOP is a list of FOPs to execute (either in
>> order or any order)?
> That is the intent. The question is how do we specify the fops that we
> want to do and the arguments to the fop. In this approach, for example
> xl_fxattrop_writev() is a new FOP. List of fops that need to be done
> are fxattrop, writev in that order and the arguments are a union of
> the arguments needed to perform the fops fxattrop, writev. The reason
> why this fop is not implemented through out the graph is to not change
> most of the stack on the brick side in the first cut of the
> implementation. i.e. quota/barrier/geo-rep/io-threads
> priorities/bit-rot may have to implement these new compund fops. We
> still get the benefit of avoiding the network round trips.
>>
>> With a scheme like the above we could,
>> - compound any set of FOPs (of course, we need to take care here,
>> but still the feasibility exists)
> It still exists but the fop space will be blown for each of the
> combination.
>> - Each xlator can inspect the compound relation and chose to
>> uncompound them. So if an xlator cannot perform FOPA+B as a single
>> compound FOP, it can choose to send FOPA and then FOPB and chain up
>> the responses back to the compound request sent to it. Also, the
>> intention here would be to leverage existing FOP code in any xlator,
>> to appropriately modify the inargs
>> - The RPC payload is constructed based on existing FOP RPC
>> definitions, but compounded based on the compound FOP RPC definition
> This will be done in phase-3 after learning a bit more about how best
> to implement it to prevent stuffing arguments in xdata in future as
> much as possible. After which we can choose to retire
> compound-fop-sender and receiver xlators.
>>
>> Possibly on the brick graph as well, pass these down as compounded
>> FOPs, till someone decides to break it open and do it in phases
>> (ultimately POSIX xlator).
> This will be done in phase-2. At the moment we are not giving any
> choice for the xlators on the brick side.
>>
>> The intention would be to break a compound FOP in case an xlator in
>> between cannot support it or, even expand a compound FOP request, say
>> the fxattropAndWrite is an AFR compounding decision, but a compound
>> request to AFR maybe WriteandClose, hence AFR needs to extend this
>> compound request.
> Yes. There was a discussion with krutika where if shard wants to do
> write then xattrop in a single fop, then we need dht to implement
> dht_writev_fxattrop which should look somewhat similar to
> dht_writev(), and afr will need to implement afr_writev_fxattrop() as
> full blown transaction where it needs to take data+metadata domain
> locks then do data+metadata pre-op then wind to
> compound_fop_sender_writev_fxattrop() and then data+metadata post-op
> then unlocks.
>
> If we were to do writev, fxattrop separately, fops will be (In
> unoptimized case):
> 1) finodelk for write
> 2) fxattrop for preop of write.
> 3) write
> 4) fxattrop for post op of write
> 5) unlock for write
> 6) finodelk for fxattrop
> 7) fxattrop for preop of shard-fxattrop
> 8) shard-fxattrop
> 9) fxattrop for post op of shard fxattrop
> 10) unlock forfxattrop
>
> If AFR chooses to implement writev_fxattrop: means data+metadata
> transaction.
> 1) finodelk in data, metadata domain simultaneously (just like we take
> multiple locks in rename)
> 2) preop for data, metadata parts as part of the compound fop
> 3) writev+fxattrop
> 4)postop for data, metadata parts as part of the compound fop
> 5) unlocks simultaneously.
>
> So it is still 2x reduction of the number of network fops except for
> may be locking.
>>
>> The above is just a off the cuff thought on the same.
> We need to arrive at a consensus about how to specify the list of fops
> and their arguments. The reason why I went against list_of_fops is to
> make discovery of possibile optimizations we can do easier per
> compound fop (Inspired by ec's implementation of multiplications by
> all possible elements in the Galois field, where multiplication with
> different number has a different optimization). Could you elaborate
> more about the idea you have about list_of_fops and its arguments? May
> be we can come up with combinations of fops where we can employ this
> technique of just list_of_fops and wind. I think rest of the solutions
> you mentioned is where it will converge towards over time. Intention
> is to avoid network round trips without waiting for the whole stack to
> change as much as possible.
May be I am over thinking it. Not a lot of combinations could be
transactions. In any case do let me know what you have in mind.
>
> Pranith
>>
>> The scheme below seems too specific to my eyes, and looks like we
>> would be defining specific compound FOPs than the ability to have
>> generic ones.
>>
>> On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote:
>>> hi,
>>>
>>> Draft of the design doc:
>>>
>>> Main motivation for the design of this feature is to reduce network
>>> round trips by sending more
>>> than one fop in a network operation, preferably without introducing new
>>> rpcs.
>>>
>>> There are new 2 new xlators compound-fop-sender, compound-fop-receiver.
>>> compound-fop-sender is going to be loaded on top of each client-xlator
>>> on the
>>> mount/client and compound-fop-receiver is going to be loaded below
>>> server-xlator on the bricks. On the mount/client side from the caller
>>> xlator
>>> till compund-fop-encoder xlator, the xlators can choose to implement
>>> this extra
>>> compound fop handling. Once it reaches "compound-fop-sender" it will
>>> try to
>>> choose a base fop on which it encodes the other fop in the base-fop's
>>> xdata,
>>> and winds the base fop to client xlator(). client xlator sends the
>>> base fop
>>> with encoded xdata to server xlator on the brick using rpc of the
>>> base fop.
>>> Once server xlator does resolve_and_resume() it will wind the base
>>> fop to
>>> compound-fop-receiver xlator. This fop will decode the extra fop from
>>> xdata of
>>> the base-fop. Based on the order encoded in the xdata it executes
>>> separate fops
>>> one after the other and stores the cbk response arguments of both the
>>> operations. It again encodes the response of the extra fop on to the
>>> base fop's
>>> response xdata and unwind the fop to server xlator. Sends the response
>>> using
>>> base-rpc's response structure. Client xlator will unwind the base
>>> fop to
>>> compound-fop-sender, which will decode the response to the compound
>>> fop's
>>> response arguments of the compound fop and unwind to the parent
>>> xlators.
>>>
>>> I will take an example of fxattrop+write operation that we want to
>>> implement in
>>> afr as an example to explain how things may look.
>>>
>>> compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this,
>>> fd_t * fd,
>>> gf_xattrop_flags_t flags,
>>> dict_t * fxattrop_dict,
>>> dict_t * fxattrop_xdata,
>>> struct iovec * vector,
>>> int32_t count,
>>> off_t off,
>>> uint32_t flags,
>>> struct iobref * iobref,
>>> dict_t * writev_xdata)
>>> ) {
>>> 0) Remember the compound-fop
>>> take base-fop as write()
>>> in wriev_xdata add the following key,value pairs
>>> 1) "xattrop-flags", flags
>>> 2) for-each-fxattrop_dict key -> "fxattrop-dict-<actual-key>",
>>> value
>>> 3) for-each-fxattrop_xdata key ->
>>> "fxattrop-xdata-<actual-key>", value
>>> 4) "order" -> "fxattrop, writev"
>>> 5) "compound-fops" -> "fxattrop"
>>> 6) Wind writev()
>>> }
>>>
>>> compound_fop_sender_fxattrop_write_cbk(...)
>>> {
>>> /*decode the response args and call
>>> parent_fxattrop_write_cbk*/
>>> }
>>>
>>> <compound_fop_sender_parent>_fxattrop_write_cbk (call_frame_t *frame,
>>> void *cookie,
>>> xlator_t *this, int32_t
>>> fxattrop_op_ret,
>>> int32_t fxattrop_op_errno,
>>> dict_t *fxattrop_dict,
>>> dict_t *fxattrop_xdata,
>>> int32_t writev_op_ret, int32_t
>>> writev_op_errno,
>>> struct iatt *writev_prebuf,
>>> struct iatt *writev_postbuf,
>>> dict_t *writev_xdata)
>>> {
>>> /**/
>>> }
>>>
>>> compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this,
>>> fd_t *
>>> fd,
>>> struct iovec * vector,
>>> int32_t count,
>>> off_t off,
>>> uint32_t flags,
>>> struct iobref * iobref,
>>> dict_t * writev_xdata)
>>> {
>>> 0) Check if writev_xdata has "compound-fop" else
>>> default_writev()
>>> 2) decode writev_xdata from above encoding -> flags,
>>> fxattrop_dict, fxattrop-xdata
>>> 3) get "order"
>>> 4) Store all the above in 'local'
>>> 5) wind fxattrop() with
>>> compound_receiver_fxattrop_cbk_writev_wind() as cbk
>>> }
>>>
>>> compound_receiver_fxattrop_cbk_writev_wind (call_frame_t *frame, void
>>> *cookie,
>>> xlator_t *this, int32_t
>>> op_ret,
>>> int32_t op_errno, dict_t
>>> *dict,
>>> dict_t *xdata)
>>> {
>>> 0) store fxattrop cbk_args
>>> 1) Perform writev() with writev_params with
>>> compound_receiver_writev_cbk() as the 'cbk'
>>> }
>>>
>>> compound_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
>>> int32_t op_ret, int32_t op_errno, struct iatt
>>> *prebuf,
>>> struct iatt *postbuf, dict_t *xdata)
>>> {
>>> 0) store writev cbk_args
>>> 1) Encode fxattrop response to writev_xdata with similar
>>> encoding in the compound_fop_sender_fxattrop_write()
>>> 2) unwind writev()
>>> }
>>>
>>> This example is just to show how things may look, but the actual
>>> implementation
>>> may just have all base-fops calling common function to perform the
>>> operations
>>> in the order given in the receriver xl. Yet to think about that. It is
>>> probably better to Encode
>>> fop-number from glusterfs_fop_t rather than the fop-string in the
>>> dictionary.
>>>
>>> This is phase-1 of the change because we don't want to change RPCs
>>> in phase-2 we can implement the compound fops that are commonly used by
>>> lot of translators throughout the stack so that
>>> quota/bitrot/geo-rep/barrier etc handle them
>>> in phase-3 may be just in time for 4.0 we can convert them to on the
>>> wire RPCs
>>>
>>> Thanks to Raghavendra G, krutika, Ravi, Anuradha for the discussions
>>>
>>> Pranith
>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list