[Gluster-devel] compound fop design first cut

Tue Dec 8 07:38:21 UTC 2015

> 
> On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On 12/08/2015 02:53 AM, Shyam wrote:
> >> Hi,
> >>
> >> Why not think along the lines of new FOPs like fop_compound(_cbk)
> >> where, the inargs to this FOP is a list of FOPs to execute (either in
> >> order or any order)?
> > That is the intent. The question is how do we specify the fops that we
> > want to do and the arguments to the fop. In this approach, for example
> > xl_fxattrop_writev() is a new FOP. List of fops that need to be done
> > are fxattrop, writev in that order and the arguments are a union of
> > the arguments needed to perform the fops fxattrop, writev. The reason
> > why this fop is not implemented through out the graph is to not change
> > most of the stack on the brick side in the first cut of the
> > implementation. i.e. quota/barrier/geo-rep/io-threads
> > priorities/bit-rot may have to implement these new compund fops. We
> > still get the benefit of avoiding the network round trips.
> >>
> >> With a scheme like the above we could,
> >>  - compound any set of FOPs (of course, we need to take care here,
> >> but still the feasibility exists)
> > It still exists but the fop space will be blown for each of the
> > combination.
> >>  - Each xlator can inspect the compound relation and chose to
> >> uncompound them. So if an xlator cannot perform FOPA+B as a single
> >> compound FOP, it can choose to send FOPA and then FOPB and chain up
> >> the responses back to the compound request sent to it. Also, the
> >> intention here would be to leverage existing FOP code in any xlator,
> >> to appropriately modify the inargs
> >>  - The RPC payload is constructed based on existing FOP RPC
> >> definitions, but compounded based on the compound FOP RPC definition
> > This will be done in phase-3 after learning a bit more about how best
> > to implement it to prevent stuffing arguments in xdata in future as
> > much as possible. After which we can choose to retire
> > compound-fop-sender and receiver xlators.
> >>
> >> Possibly on the brick graph as well, pass these down as compounded
> >> FOPs, till someone decides to break it open and do it in phases
> >> (ultimately POSIX xlator).
> > This will be done in phase-2. At the moment we are not giving any
> > choice for the xlators on the brick side.
> >>
> >> The intention would be to break a compound FOP in case an xlator in
> >> between cannot support it or, even expand a compound FOP request, say
> >> the fxattropAndWrite is an AFR compounding decision, but a compound
> >> request to AFR maybe WriteandClose, hence AFR needs to extend this
> >> compound request.
> > Yes. There was a discussion with krutika where if shard wants to do
> > write then xattrop in a single fop, then we need dht to implement
> > dht_writev_fxattrop which should look somewhat similar to
> > dht_writev(), and afr will need to implement afr_writev_fxattrop() as
> > full blown transaction where it needs to take data+metadata domain
> > locks then do data+metadata pre-op then wind to
> > compound_fop_sender_writev_fxattrop() and then data+metadata post-op
> > then unlocks.
> >
> > If we were to do writev, fxattrop separately, fops will be (In
> > unoptimized case):
> > 1) finodelk for write
> > 2) fxattrop for preop of write.
> > 3) write
> > 4) fxattrop for post op of write
> > 5) unlock for write
> > 6) finodelk for fxattrop
> > 7) fxattrop for preop of shard-fxattrop
> > 8) shard-fxattrop
> > 9) fxattrop for post op of shard fxattrop
> > 10) unlock forfxattrop
> >
> > If AFR chooses to implement writev_fxattrop: means data+metadata
> > transaction.
> > 1) finodelk in data, metadata domain simultaneously (just like we take
> > multiple locks in rename)
> > 2) preop for data, metadata parts as part of the compound fop
> > 3) writev+fxattrop
> > 4)postop for data, metadata parts as part of the compound fop
> > 5) unlocks simultaneously.
> >
> > So it is still 2x reduction of the number of network fops except for
> > may be locking.
> >>
> >> The above is just a off the cuff thought on the same.
> > We need to arrive at a consensus about how to specify the list of fops
> > and their arguments. The reason why I went against list_of_fops is to
> > make discovery of possibile optimizations we can do easier per
> > compound fop (Inspired by ec's implementation of multiplications by
> > all possible elements in the Galois field, where multiplication with
> > different number has a different optimization). Could you elaborate
> > more about the idea you have about list_of_fops and its arguments? May
> > be we can come up with combinations of fops where we can employ this
> > technique of just list_of_fops and wind. I think rest of the solutions
> > you mentioned is where it will converge towards over time. Intention
> > is to avoid network round trips without waiting for the whole stack to
> > change as much as possible.
> May be I am over thinking it. Not a lot of combinations could be
> transactions. In any case do let me know what you have in mind.

Just to add some concrete data, I am listing below what are the compound ops I need now (apart from afr):

1. Atomic compound (mkdir + inodelk):
=====================================

what: create directory and acquire an inodelk if directory doesn't exist. Both these operations have to be atomic. Once directory creation is successful, inodelk _has_ to be granted.

How: As we discussed the other day, compound xlator need not handle the atomicity issues. All it has to do is mkdir followed by inodelk. The atomicity part can be handled by another xlator like dentry-serialization which will be loaded as a parent of compound xlator. Since dentry-serialization serializes lookup and mkdir (or compound mkdir + inodelk), directory won't be visible outside brick till inodelk is acquired, thereby preventing other clients from acquiring inodelk on the directory in the window b/w mkdir and inodelk.

2. mknod which fails with EEXIST if:
   a. path is present.
   b. gfid handle corresponding to gfid arg passed along mknod is present

why: This is a requirement from geo-replication team. With changelogs and gfid-access translator, gfids are first-class citizens along with path and there are scenarios (rename races) where path need not be present but gfid-handle might be present. To solve this issue, mknod will fail with EEXIST even if only gfid-handle is present (without path not being present).

>From what I can see, new compound ops will _evolve_ in future based on requirements unseen as of now.

regards,
Raghavendra.