[Gluster-devel] Erasure code - A few questions
Xavier Hernandez
xhernandez at datalab.es
Tue Jun 10 08:50:00 UTC 2014
Hi Krish,
I've written this quite fast and probably I'll have missed some important
details, but I hope it contains the essential points to understand how it
works.
I'll try to write a high level overview of the execution flow of all fops and
then I'll detail what happens in each step for readv/writev fops.
Each fop entry point is ec_gf_<fop>(), that immediately calls ec_<fop>().
There an ec_fop_data_t structure is created. This structure controls the live
cycle of the fop execution. Just after having created this structure,
ec_manager() is called, where the state machine begins execution at state
EC_STATE_INIT.
Each fop specifies a function, called ec_manager_<fop>() that manages what to
do in each state. However, since many actions are equal for all fops, this
function only has to process special cases for the given fop. Common state
transitions and other stuff are done in ec_default_manager().
All state transitions are controlled by the completion of suboperations
initiated in a given state. This means that if a fop calls another fop or
STACK_WIND() during the processing of one state (for example to lock an inode
or send the request to the next xlator), the next state won't be processed
until this other operation has finished. This is controlled by ec_wait(), that
checks if there are other operations and returns an error code indicating the
result of that operation or -1 if there are pending operations. In this case
the state machine "sleeps" until that operation finishes.
Following is the meaning of each state and possible transitions between them.
In this description it is assumed that nothing fails. If something fails in
any state, the next state will have the sign changed. For example, if there is
an error during operations done in EC_STATE_PREOP, once they finish, the next
state will be -EC_STATE_DISPATCH instead of EC_STATE_DISPATCH.
EC_STATE_INIT:
The fop has a chance to modify its input arguments or prepare anything needed
to process it.
The next state depends on what flags the fop has been given. If EC_FLAG_LOCK
is present, the next state will be EC_STATE_LOCK, otherwise it will be
EC_STATE_DISPATCH.
EC_STATE_LOCK:
In this state the fop locks the entry or inode using entrylk/inodelk fops
(functions ec_entrylk() and ec_inodelk()) before processing the fop. This is
done to synchronize execution between bricks.
Which entries or inodes are locked depends on the flags given to the fop.
If EC_FLAG_PREOP is present, the next state will be EC_STATE_PREOP, otherwise
it will be EC_STATE_DISPATCH.
EC_STATE_PREOP:
In this state, a lookup on the inode is made to get the version and real size
information. This data is needed to identify corrupted bricks and determine
the real size of regular files before doing modifications on them.
The next state is always EC_STATE_DISPATCH.
EC_STATE_DISPATCH:
In this state, the fop determines how many and which bricks will be involved
in the operation and initiates the execution through ec_dispatch_(all|min|one)
(). These functions basically select a subset of alive bricks without detected
errors. Once selected, ec_wind_<fop>() is called to send the request to the
underlying xlators using STACK_WIND().
When a brick answers with STACK_UNWIND through ec_<fop>_cbk(), this answer is
stored in a ec_cbk_data_t structure and ec_combine() is called to try to find
other answers that are compatible with it (this basically means that their
return codes are equal, and basic metadata is identical, like xdata, iatt,
...). When two answers can be combined, they form a group of answers. All
answers are stored until the completion of the execution of the fop, even if
they do not match with any other answer.
Every time an answer is processed, ec_complete() is called to decrement the
number of pending wind's. When this counter goes to 0, it means that all
bricks have answered. At this point, if there hasn't been found any group with
enough answers (expected number of answers), the code checks if there's any
other group with at least a minimum amount of answers, and if that is the
case, that group is taken as the good answer. If no group satisfies the
condition, an EIO error is reported. ec_report() is called then.
When ec_combine() determines that a group has enough answers, it calls to
ec_report() to tell the fop's state machine that the processing of the fop can
continue. At this point, the next state is set to EC_STATE_REBUILD.
EC_STATE_REBUILD:
In this state fops can do any additional processing of the received answers or
initiate other tasks needed to complete the answer. Once the data is ready to
be propagated to the calling xlator, the next state is set to EC_STATE_REPORT.
EC_STATE_REPORT:
In this state the callback of the fop is called, passing to it the arguments
corresponding to the best answer received from all bricks. For normal fops
coming from ec_gf_<fop>() this only means a call to STACK_UNWIND(). In other
cases, like subfops initiated by other ec fops or self-heal operations, this
callback can be any other function that will continue the processing of the
parent operation.
Once the fop has been reported to the caller, the state machine waits until
all remaining winds have finished (this can happen in some circumstance while
processing locking fops). When all pending winds have finished, the next state
is EC_STATE_COMPLETED.
EC_STATE_COMPLETED:
In this state all processing for the fop has finished and it only determines
if a postop operation must be executed or not by looking at flag
EC_FLAG_POSTOP. If this flag is set, the next state is EC_STATE_POSTOP,
otherwise it jumps to state EC_STATE_UNLOCK.
EC_STATE_POSTOP:
In this state a call to ec_update_version() is made to increment the version
number of the file and update the real size if needed. This is done using the
ec_(f)xattrop() fops. Once completed, the next state is EC_STATE_UNLOCK.
EC_STATE_UNLOCK:
In this state, any lock acquired during EC_STATE_LOCK state is released.
This finished the state machine of the fop and releases all allocated
resources.
The sequence of calls for a readv operation is the following:
ec_gf_readv()
ec_readv()
ec_fop_data_t(): create ec_fop_data_t structure
ec_manager(): initiate state machine
ec_manager_readv() and ec_default_manager() are called for each state.
In each state, readv does the following (states not specified mean default
processing as described earlier):
EC_STATE_INIT:
Offset and size of the read are aligned to the block size and transformed to
valid values for bricks.
EC_STATE_REBUILD:
ec_readv_rebuild() is called to combine the fragments read from all bricks
into a single data block using the erasure code decoding function
ec_method_merge().
The writev operation is a bit more complex and needs to create an additional
state:
ec_gf_writev()
ec_writev()
ec_fop_data_t(): create ec_fop_data_t structure
ec_manager(): initiate state machine
ec_manager_writev() and ec_default_manager() are called for each state.
In each state, writev does the following:
EC_STATE_INIT:
Call to ec_writev_init() to align offset and size to the block size. It also
creates an aligned contiguous buffer with the contents of the data to write
that will be needed to encode it.
EC_STATE_DISPATCH:
Before starting the dispatch of the write fop to the underlying xlators using
STACK_WIND(), some write operations need to do a read of some fragments of
data before and/or after the specified offset and size of the write operation.
This is needed when a write is not aligned to the block size.
This is done through ec_writev_start(). This function initiates a readv subfop
just before the offset if it's not aligned, and another readv after
offset+size if it's not aligned.
Once the readv subfops have finished, the state machine goes to the
EC_STATE_WRITE_START, that simply does the normal processing of the
EC_STATE_DISPATCH.
EC_STATE_DISPATCH:
In this state, ec_wind_writev() is called for each subvolume. This function
computes the erasure code encoded data that will be sent to the brick using
the ec_method_split() function.
EC_STATE_REBUILD:
At this state, the fop calculates the correct return code from the write
operation and computes the resulting size of the file.
Hope this helps...
Xavi
On Monday 09 June 2014 08:15:34 Krishnan Parthasarathi wrote:
> Hi Xavi,
>
> Following the code walk through and discussion surrounding
> erasure coding translator's implementation on #gluster-meeting,
> I wanted to ask a few questions that would make things clearer
> and help speed up the review. I am CC'ing gluster-devel in a hope
> that some of these questions might have popped in others' head
> as well.
>
> While learning a translator I try to identify the different internal stages
> that a FOP goes through while 'inside' a xlator (ie, before a STACK_WIND or
> STACK_UNWIND transfer the control to the child/parent xlator).
>
> Additionally, it helps to understand the points in processing of a FOP,
> the sequence of functions lead it to flow to the child(ren) xlators
> and the sequence of functions that lead it into the xlator (via callbacks).
>
> With that context, it would help if you listed the sequence of functions,
> including the state machine functions which 'guide' the FOP through various
> sub-operations, in the following cases.
>
> - When a inode modification call (say writev) enters cluster/ec.
> - When a readv call enters cluster/ec
>
> This could be done by attaching gdb to the mount process, but what I am
> looking for is your notes/insights that would help us appreciate
> the design/intent better. It would also help us to notice this pattern
> in other FOPs implemented in cluster/ec.
>
> cheers,
> Krish
More information about the Gluster-devel
mailing list