[Gluster-devel] compound fop design first cut

Pranith Kumar Karampuri pkarampu at redhat.com
Tue Dec 8 03:32:30 UTC 2015

On 12/08/2015 02:53 AM, Shyam wrote:
> Hi,
> Why not think along the lines of new FOPs like fop_compound(_cbk) 
> where, the inargs to this FOP is a list of FOPs to execute (either in 
> order or any order)?
That is the intent. The question is how do we specify the fops that we 
want to do and the arguments to the fop. In this approach, for example 
xl_fxattrop_writev() is a new FOP. List of fops that need to be done are 
fxattrop, writev in that order and the arguments are a union of the 
arguments needed to perform the fops fxattrop, writev. The reason why 
this fop is not implemented through out the graph is to not change most 
of the stack on the brick side in the first cut of the implementation. 
i.e. quota/barrier/geo-rep/io-threads priorities/bit-rot may have to 
implement these new compund fops. We still get the benefit of avoiding 
the network round trips.
> With a scheme like the above we could,
>  - compound any set of FOPs (of course, we need to take care here, but 
> still the feasibility exists)
It still exists but the fop space will be blown for each of the combination.
>  - Each xlator can inspect the compound relation and chose to 
> uncompound them. So if an xlator cannot perform FOPA+B as a single 
> compound FOP, it can choose to send FOPA and then FOPB and chain up 
> the responses back to the compound request sent to it. Also, the 
> intention here would be to leverage existing FOP code in any xlator, 
> to appropriately modify the inargs
>  - The RPC payload is constructed based on existing FOP RPC 
> definitions, but compounded based on the compound FOP RPC definition
This will be done in phase-3 after learning a bit more about how best to 
implement it to prevent stuffing arguments in xdata in future as much as 
possible. After which we can choose to retire compound-fop-sender and 
receiver xlators.
> Possibly on the brick graph as well, pass these down as compounded 
> FOPs, till someone decides to break it open and do it in phases 
> (ultimately POSIX xlator).
This will be done in phase-2. At the moment we are not giving any choice 
for the xlators on the brick side.
> The intention would be to break a compound FOP in case an xlator in 
> between cannot support it or, even expand a compound FOP request, say 
> the fxattropAndWrite is an AFR compounding decision, but a compound 
> request to AFR maybe WriteandClose, hence AFR needs to extend this 
> compound request.
Yes. There was a discussion with krutika where if shard wants to do 
write then xattrop in a single fop, then we need dht to implement 
dht_writev_fxattrop which should look somewhat similar to dht_writev(), 
and afr will need to implement afr_writev_fxattrop() as full blown 
transaction where it needs to take data+metadata domain locks then do 
data+metadata pre-op then wind to compound_fop_sender_writev_fxattrop() 
and then data+metadata post-op then unlocks.

If we were to do writev, fxattrop separately, fops will be (In 
unoptimized case):
1) finodelk for write
2) fxattrop for preop of write.
3) write
4) fxattrop for post op of write
5) unlock for write
6) finodelk for fxattrop
7) fxattrop for preop of shard-fxattrop
8) shard-fxattrop
9) fxattrop for post op of shard fxattrop
10) unlock forfxattrop

If AFR chooses to implement writev_fxattrop: means data+metadata 
1) finodelk in data, metadata domain simultaneously (just like we take 
multiple locks in rename)
2) preop for data, metadata parts as part of the compound fop
3) writev+fxattrop
4)postop for data, metadata parts as part of the compound fop
5) unlocks simultaneously.

So it is still 2x reduction of the number of network fops except for may 
be locking.
> The above is just a off the cuff thought on the same.
We need to arrive at a consensus about how to specify the list of fops 
and their arguments. The reason why I went against list_of_fops is to 
make discovery of possibile optimizations we can do easier per compound 
fop (Inspired by ec's implementation of multiplications by all possible 
elements in the Galois field, where multiplication with different number 
has a different optimization). Could you elaborate more about the idea 
you have about list_of_fops and its arguments? May be we can come up 
with combinations of fops where we can employ this technique of just 
list_of_fops and wind. I think rest of the solutions you mentioned is 
where it will converge towards over time. Intention is to avoid network 
round trips without waiting for the whole stack to change as much as 

> The scheme below seems too specific to my eyes, and looks like we 
> would be defining specific compound FOPs than the ability to have 
> generic ones.
> On 12/07/2015 04:08 AM, Pranith Kumar Karampuri wrote:
>> hi,
>> Draft of the design doc:
>> Main motivation for the design of this feature is to reduce network
>> round trips by sending more
>> than one fop in a network operation, preferably without introducing new
>> rpcs.
>> There are new 2 new xlators compound-fop-sender, compound-fop-receiver.
>> compound-fop-sender is going to be loaded on top of each client-xlator
>> on the
>> mount/client and compound-fop-receiver is going to be loaded below
>> server-xlator on the bricks. On the mount/client side from the caller
>> xlator
>> till compund-fop-encoder xlator, the xlators can choose to implement
>> this extra
>> compound fop handling. Once it reaches "compound-fop-sender" it will 
>> try to
>> choose a base fop on which it encodes the other fop in the base-fop's
>> xdata,
>> and winds the base fop to client xlator(). client xlator sends the 
>> base fop
>> with encoded xdata to server xlator on the brick using rpc of the 
>> base fop.
>> Once server xlator does resolve_and_resume() it will wind the base 
>> fop to
>> compound-fop-receiver xlator. This fop will decode the extra fop from
>> xdata of
>> the base-fop. Based on the order encoded in the xdata it executes
>> separate fops
>> one after the other and stores the cbk response arguments of both the
>> operations. It again encodes the response of the extra fop on to the
>> base fop's
>> response xdata and unwind the fop to server xlator. Sends the response
>> using
>> base-rpc's response structure. Client xlator will unwind the base fop to
>> compound-fop-sender, which will decode the response to the compound 
>> fop's
>> response arguments of the compound fop and unwind to the parent xlators.
>> I will take an example of fxattrop+write operation that we want to
>> implement in
>> afr as an example to explain how things may look.
>> compound_fop_sender_fxattrop_write(call_frame_t *frame, xlator_t *this,
>> fd_t * fd,
>>          gf_xattrop_flags_t flags,
>>          dict_t * fxattrop_dict,
>>          dict_t * fxattrop_xdata,
>>          struct iovec * vector,
>>          int32_t count,
>>          off_t off,
>>          uint32_t flags,
>>          struct iobref * iobref,
>>          dict_t * writev_xdata)
>> ) {
>>          0) Remember the compound-fop
>>          take base-fop as write()
>>          in wriev_xdata add the following key,value pairs
>>          1) "xattrop-flags", flags
>>          2) for-each-fxattrop_dict key -> "fxattrop-dict-<actual-key>",
>> value
>>          3) for-each-fxattrop_xdata key ->
>> "fxattrop-xdata-<actual-key>", value
>>          4) "order" -> "fxattrop, writev"
>>          5) "compound-fops" -> "fxattrop"
>>          6) Wind writev()
>> }
>> compound_fop_sender_fxattrop_write_cbk(...)
>> {
>>          /*decode the response args and call parent_fxattrop_write_cbk*/
>> }
>> <compound_fop_sender_parent>_fxattrop_write_cbk (call_frame_t *frame,
>> void *cookie,
>>                                          xlator_t *this, int32_t
>> fxattrop_op_ret,
>>                                          int32_t fxattrop_op_errno,
>>                                          dict_t *fxattrop_dict,
>>                                          dict_t *fxattrop_xdata,
>>                                          int32_t writev_op_ret, int32_t
>> writev_op_errno,
>>                                          struct iatt *writev_prebuf,
>>                                          struct iatt *writev_postbuf,
>>                                          dict_t *writev_xdata)
>> {
>> /**/
>> }
>> compound_fop_receiver_writev(call_frame_t *frame, xlator_t *this, fd_t *
>> fd,
>>          struct iovec * vector,
>>          int32_t count,
>>          off_t off,
>>          uint32_t flags,
>>          struct iobref * iobref,
>>          dict_t * writev_xdata)
>> {
>>          0) Check if writev_xdata has "compound-fop" else 
>> default_writev()
>>          2) decode writev_xdata from above encoding -> flags,
>> fxattrop_dict, fxattrop-xdata
>>          3) get "order"
>>          4) Store all the above in 'local'
>>          5) wind fxattrop() with
>> compound_receiver_fxattrop_cbk_writev_wind() as cbk
>> }
>> compound_receiver_fxattrop_cbk_writev_wind (call_frame_t *frame, void
>> *cookie,
>>                                              xlator_t *this, int32_t
>> op_ret,
>>                                              int32_t op_errno, dict_t
>> *dict,
>>                                              dict_t *xdata)
>> {
>>          0) store fxattrop cbk_args
>>          1) Perform writev() with writev_params with
>> compound_receiver_writev_cbk() as the 'cbk'
>> }
>> compound_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
>>                       int32_t op_ret, int32_t op_errno, struct iatt
>> *prebuf,
>>                       struct iatt *postbuf, dict_t *xdata)
>> {
>>          0) store writev cbk_args
>>          1) Encode fxattrop response to writev_xdata with similar
>> encoding in the compound_fop_sender_fxattrop_write()
>>          2) unwind writev()
>> }
>> This example is just to show how things may look, but the actual
>> implementation
>> may just have all base-fops calling common function to perform the
>> operations
>> in the order given in the receriver xl. Yet to think about that. It is
>> probably better to Encode
>> fop-number from glusterfs_fop_t rather than the fop-string in the
>> dictionary.
>> This is phase-1 of the change because we don't want to change RPCs
>> in phase-2 we can implement the compound fops that are commonly used by
>> lot of translators throughout the stack so that
>> quota/bitrot/geo-rep/barrier etc handle them
>> in phase-3 may be just in time for 4.0 we can convert them to on the
>> wire RPCs
>> Thanks to Raghavendra G, krutika, Ravi, Anuradha for the discussions
>> Pranith
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel

More information about the Gluster-devel mailing list