[Gluster-devel] compound fop design first cut

Pranith Kumar Karampuri pkarampu at redhat.com
Tue Dec 8 07:18:55 UTC 2015

On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote:
> On 12/08/2015 02:53 AM, Shyam wrote:
>> Hi,
>> Why not think along the lines of new FOPs like fop_compound(_cbk) 
>> where, the inargs to this FOP is a list of FOPs to execute (either in 
>> order or any order)?
> That is the intent. The question is how do we specify the fops that we 
> want to do and the arguments to the fop. In this approach, for example 
> xl_fxattrop_writev() is a new FOP. List of fops that need to be done 
> are fxattrop, writev in that order and the arguments are a union of 
> the arguments needed to perform the fops fxattrop, writev. The reason 
> why this fop is not implemented through out the graph is to not change 
> most of the stack on the brick side in the first cut of the 
> implementation. i.e. quota/barrier/geo-rep/io-threads 
> priorities/bit-rot may have to implement these new compund fops. We 
> still get the benefit of avoiding the network round trips.
>> With a scheme like the above we could,
>>  - compound any set of FOPs (of course, we need to take care here, 
>> but still the feasibility exists)
> It still exists but the fop space will be blown for each of the 
> combination.
>>  - Each xlator can inspect the compound relation and chose to 
>> uncompound them. So if an xlator cannot perform FOPA+B as a single 
>> compound FOP, it can choose to send FOPA and then FOPB and chain up 
>> the responses back to the compound request sent to it. Also, the 
>> intention here would be to leverage existing FOP code in any xlator, 
>> to appropriately modify the inargs
>>  - The RPC payload is constructed based on existing FOP RPC 
>> definitions, but compounded based on the compound FOP RPC definition
> This will be done in phase-3 after learning a bit more about how best 
> to implement it to prevent stuffing arguments in xdata in future as 
> much as possible. After which we can choose to retire 
> compound-fop-sender and receiver xlators.
>> Possibly on the brick graph as well, pass these down as compounded 
>> FOPs, till someone decides to break it open and do it in phases 
>> (ultimately POSIX xlator).
> This will be done in phase-2. At the moment we are not giving any 
> choice for the xlators on the brick side.
>> The intention would be to break a compound FOP in case an xlator in 
>> between cannot support it or, even expand a compound FOP request, say 
>> the fxattropAndWrite is an AFR compounding decision, but a compound 
>> request to AFR maybe WriteandClose, hence AFR needs to extend this 
>> compound request.
> Yes. There was a discussion with krutika where if shard wants to do 
> write then xattrop in a single fop, then we need dht to implement 
> dht_writev_fxattrop which should look somewhat similar to 
> dht_writev(), and afr will need to implement afr_writev_fxattrop() as 
> full blown transaction where it needs to take data+metadata domain 
> locks then do data+metadata pre-op then wind to 
> compound_fop_sender_writev_fxattrop() and then data+metadata post-op 
> then unlocks.
> If we were to do writev, fxattrop separately, fops will be (In 
> unoptimized case):
> 1) finodelk for write
> 2) fxattrop for preop of write.
> 3) write
> 4) fxattrop for post op of write
> 5) unlock for write
> 6) finodelk for fxattrop
> 7) fxattrop for preop of shard-fxattrop
> 8) shard-fxattrop
> 9) fxattrop for post op of shard fxattrop
> 10) unlock forfxattrop
> If AFR chooses to implement writev_fxattrop: means data+metadata 
> transaction.
> 1) finodelk in data, metadata domain simultaneously (just like we take 
> multiple locks in rename)
> 2) preop for data, metadata parts as part of the compound fop
> 3) writev+fxattrop
> 4)postop for data, metadata parts as part of the compound fop
> 5) unlocks simultaneously.
> So it is still 2x reduction of the number of network fops except for 
> may be locking.
>> The above is just a off the cuff thought on the same.
> We need to arrive at a consensus about how to specify the list of fops 
> and their arguments. The reason why I went against list_of_fops is to 
> make discovery of possibile optimizations we can do easier per 
> compound fop (Inspired by ec's implementation of multiplications by 
> all possible elements in the Galois field, where multiplication with 
> different number has a different optimization). Could you elaborate 
> more about the idea you have about list_of_fops and its arguments? May 
> be we can come up with combinations of fops where we can employ this 
> technique of just list_of_fops and wind. I think rest of the solutions 
> you mentioned is where it will converge towards over time. Intention 
> is to avoid network round trips without waiting for the whole stack to 
> change as much as possible.
May be I am over thinking it. Not a lot of combinations could be 
transactions. In any case do let me know what you have in mind.

> Pranith
>> The scheme below seems too specific to my eyes, and looks like we 
>> would be defining specific compound FOPs than the ability to have 
>> generic ones.
>> On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote:
>>> hi,
>>> Draft of the design doc:
>>> Main motivation for the design of this feature is to reduce network
>>> round trips by sending more
>>> than one fop in a network operation, preferably without introducing new
>>> rpcs.
>>> There are new 2 new xlators compound-fop-sender, compound-fop-receiver.
>>> compound-fop-sender is going to be loaded on top of each client-xlator
>>> on the
>>> mount/client and compound-fop-receiver is going to be loaded below
>>> server-xlator on the bricks. On the mount/client side from the caller
>>> xlator
>>> till compund-fop-encoder xlator, the xlators can choose to implement
>>> this extra
>>> compound fop handling. Once it reaches "compound-fop-sender" it will 
>>> try to
>>> choose a base fop on which it encodes the other fop in the base-fop's
>>> xdata,
>>> and winds the base fop to client xlator(). client xlator sends the 
>>> base fop
>>> with encoded xdata to server xlator on the brick using rpc of the 
>>> base fop.
>>> Once server xlator does resolve_and_resume() it will wind the base 
>>> fop to
>>> compound-fop-receiver xlator. This fop will decode the extra fop from
>>> xdata of
>>> the base-fop. Based on the order encoded in the xdata it executes
>>> separate fops
>>> one after the other and stores the cbk response arguments of both the
>>> operations. It again encodes the response of the extra fop on to the
>>> base fop's
>>> response xdata and unwind the fop to server xlator. Sends the response
>>> using
>>> base-rpc's response structure. Client xlator will unwind the base 
>>> fop to
>>> compound-fop-sender, which will decode the response to the compound 
>>> fop's
>>> response arguments of the compound fop and unwind to the parent 
>>> xlators.
>>> I will take an example of fxattrop+write operation that we want to
>>> implement in
>>> afr as an example to explain how things may look.
>>> compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this,
>>> fd_t * fd,
>>>          gf_xattrop_flags_t flags,
>>>          dict_t * fxattrop_dict,
>>>          dict_t * fxattrop_xdata,
>>>          struct iovec * vector,
>>>          int32_t count,
>>>          off_t off,
>>>          uint32_t flags,
>>>          struct iobref * iobref,
>>>          dict_t * writev_xdata)
>>> ) {
>>>          0) Remember the compound-fop
>>>          take base-fop as write()
>>>          in wriev_xdata add the following key,value pairs
>>>          1) "xattrop-flags", flags
>>>          2) for-each-fxattrop_dict key -> "fxattrop-dict-<actual-key>",
>>> value
>>>          3) for-each-fxattrop_xdata key ->
>>> "fxattrop-xdata-<actual-key>", value
>>>          4) "order" -> "fxattrop, writev"
>>>          5) "compound-fops" -> "fxattrop"
>>>          6) Wind writev()
>>> }
>>> compound_fop_sender_fxattrop_write_cbk(...)
>>> {
>>>          /*decode the response args and call 
>>> parent_fxattrop_write_cbk*/
>>> }
>>> <compound_fop_sender_parent>_fxattrop_write_cbk (call_frame_t *frame,
>>> void *cookie,
>>>                                          xlator_t *this, int32_t
>>> fxattrop_op_ret,
>>>                                          int32_t fxattrop_op_errno,
>>>                                          dict_t *fxattrop_dict,
>>>                                          dict_t *fxattrop_xdata,
>>>                                          int32_t writev_op_ret, int32_t
>>> writev_op_errno,
>>>                                          struct iatt *writev_prebuf,
>>>                                          struct iatt *writev_postbuf,
>>>                                          dict_t *writev_xdata)
>>> {
>>> /**/
>>> }
>>> compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this, 
>>> fd_t *
>>> fd,
>>>          struct iovec * vector,
>>>          int32_t count,
>>>          off_t off,
>>>          uint32_t flags,
>>>          struct iobref * iobref,
>>>          dict_t * writev_xdata)
>>> {
>>>          0) Check if writev_xdata has "compound-fop" else 
>>> default_writev()
>>>          2) decode writev_xdata from above encoding -> flags,
>>> fxattrop_dict, fxattrop-xdata
>>>          3) get "order"
>>>          4) Store all the above in 'local'
>>>          5) wind fxattrop() with
>>> compound_receiver_fxattrop_cbk_writev_wind() as cbk
>>> }
>>> compound_receiver_fxattrop_cbk_writev_wind (call_frame_t *frame, void
>>> *cookie,
>>>                                              xlator_t *this, int32_t
>>> op_ret,
>>>                                              int32_t op_errno, dict_t
>>> *dict,
>>>                                              dict_t *xdata)
>>> {
>>>          0) store fxattrop cbk_args
>>>          1) Perform writev() with writev_params with
>>> compound_receiver_writev_cbk() as the 'cbk'
>>> }
>>> compound_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
>>>                       int32_t op_ret, int32_t op_errno, struct iatt
>>> *prebuf,
>>>                       struct iatt *postbuf, dict_t *xdata)
>>> {
>>>          0) store writev cbk_args
>>>          1) Encode fxattrop response to writev_xdata with similar
>>> encoding in the compound_fop_sender_fxattrop_write()
>>>          2) unwind writev()
>>> }
>>> This example is just to show how things may look, but the actual
>>> implementation
>>> may just have all base-fops calling common function to perform the
>>> operations
>>> in the order given in the receriver xl. Yet to think about that. It is
>>> probably better to Encode
>>> fop-number from glusterfs_fop_t rather than the fop-string in the
>>> dictionary.
>>> This is phase-1 of the change because we don't want to change RPCs
>>> in phase-2 we can implement the compound fops that are commonly used by
>>> lot of translators throughout the stack so that
>>> quota/bitrot/geo-rep/barrier etc handle them
>>> in phase-3 may be just in time for 4.0 we can convert them to on the
>>> wire RPCs
>>> Thanks to Raghavendra G, krutika, Ravi, Anuradha for the discussions
>>> Pranith
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel

More information about the Gluster-devel mailing list