[Gluster-devel] Erasure code - A few questions

Tue Jun 10 08:50:00 UTC 2014

Hi Krish,

I've written this quite fast and probably I'll have missed some important 
details, but I hope it contains the essential points to understand how it 
works.

I'll try to write a high level overview of the execution flow of all fops and 
then I'll detail what happens in each step for readv/writev fops.

Each fop entry point is ec_gf_<fop>(), that immediately calls ec_<fop>(). 
There an ec_fop_data_t structure is created. This structure controls the live 
cycle of the fop execution. Just after having created this structure, 
ec_manager() is called, where the state machine begins execution at state 
EC_STATE_INIT.

Each fop specifies a function, called ec_manager_<fop>() that manages what to 
do in each state. However, since many actions are equal for all fops, this 
function only has to process special cases for the given fop. Common state 
transitions and other stuff are done in ec_default_manager().

All state transitions are controlled by the completion of suboperations 
initiated in a given state. This means that if a fop calls another fop or 
STACK_WIND() during the processing of one state (for example to lock an inode 
or send the request to the next xlator), the next state won't be processed 
until this other operation has finished. This is controlled by ec_wait(), that 
checks if there are other operations and returns an error code indicating the 
result of that operation or -1 if there are pending operations. In this case 
the state machine "sleeps" until that operation finishes.

Following is the meaning of each state and possible transitions between them. 
In this description it is assumed that nothing fails. If something fails in 
any state, the next state will have the sign changed. For example, if there is 
an error during operations done in EC_STATE_PREOP, once they finish, the next 
state will be -EC_STATE_DISPATCH instead of EC_STATE_DISPATCH.

EC_STATE_INIT:

The fop has a chance to modify its input arguments or prepare anything needed 
to process it.

The next state depends on what flags the fop has been given. If EC_FLAG_LOCK 
is present, the next state will be EC_STATE_LOCK, otherwise it will be 
EC_STATE_DISPATCH.

EC_STATE_LOCK:

In this state the fop locks the entry or inode using entrylk/inodelk fops 
(functions ec_entrylk() and ec_inodelk()) before processing the fop. This is 
done to synchronize execution between bricks.

Which entries or inodes are locked depends on the flags given to the fop.

If EC_FLAG_PREOP is present, the next state will be EC_STATE_PREOP, otherwise 
it will be EC_STATE_DISPATCH.

EC_STATE_PREOP:

In this state, a lookup on the inode is made to get the version and real size 
information. This data is needed to identify corrupted bricks and determine 
the real size of regular files before doing modifications on them.

The next state is always EC_STATE_DISPATCH.

EC_STATE_DISPATCH:

In this state, the fop determines how many and which bricks will be involved 
in the operation and initiates the execution through ec_dispatch_(all|min|one)
(). These functions basically select a subset of alive bricks without detected 
errors. Once selected, ec_wind_<fop>() is called to send the request to the 
underlying xlators using STACK_WIND().

When a brick answers with STACK_UNWIND through ec_<fop>_cbk(), this answer is 
stored in a ec_cbk_data_t structure and ec_combine() is called to try to find 
other answers that are compatible with it (this basically means that their 
return codes are equal, and basic metadata is identical, like xdata, iatt, 
...). When two answers can be combined, they form a group of answers. All 
answers are stored until the completion of the execution of the fop, even if 
they do not match with any other answer.

Every time an answer is processed, ec_complete() is called to decrement the 
number of pending wind's. When this counter goes to 0, it means that all 
bricks have answered. At this point, if there hasn't been found any group with 
enough answers (expected number of answers), the code checks if there's any 
other group with at least a minimum amount of answers, and if that is the 
case, that group is taken as the good answer. If no group satisfies the 
condition, an EIO error is reported. ec_report() is called then.

When ec_combine() determines that a group has enough answers, it calls to 
ec_report() to tell the fop's state machine that the processing of the fop can 
continue. At this point, the next state is set to EC_STATE_REBUILD.

EC_STATE_REBUILD:

In this state fops can do any additional processing of the received answers or 
initiate other tasks needed to complete the answer. Once the data is ready to 
be propagated to the calling xlator, the next state is set to EC_STATE_REPORT.

EC_STATE_REPORT:

In this state the callback of the fop is called, passing to it the arguments 
corresponding to the best answer received from all bricks. For normal fops 
coming from ec_gf_<fop>() this only means a call to STACK_UNWIND(). In other 
cases, like subfops initiated by other ec fops or self-heal operations, this 
callback can be any other function that will continue the processing of the 
parent operation.

Once the fop has been reported to the caller, the state machine waits until 
all remaining winds have finished (this can happen in some circumstance while 
processing locking fops). When all pending winds have finished, the next state 
is EC_STATE_COMPLETED.

EC_STATE_COMPLETED:

In this state all processing for the fop has finished and it only determines 
if a postop operation must be executed or not by looking at flag 
EC_FLAG_POSTOP. If this flag is set, the next state is EC_STATE_POSTOP, 
otherwise it jumps to state EC_STATE_UNLOCK.

EC_STATE_POSTOP:

In this state a call to ec_update_version() is made to increment the version 
number of the file and update the real size if needed. This is done using the 
ec_(f)xattrop() fops. Once completed, the next state is EC_STATE_UNLOCK.

EC_STATE_UNLOCK:

In this state, any lock acquired during EC_STATE_LOCK state is released.

This finished the state machine of the fop and releases all allocated 
resources.

The sequence of calls for a readv operation is the following:

ec_gf_readv()
ec_readv()
ec_fop_data_t(): create ec_fop_data_t structure
ec_manager(): initiate state machine

ec_manager_readv() and ec_default_manager() are called for each state.

In each state, readv does the following (states not specified mean default 
processing as described earlier):

EC_STATE_INIT:

Offset and size of the read are aligned to the block size and transformed to 
valid values for bricks.

EC_STATE_REBUILD:

ec_readv_rebuild() is called to combine the fragments read from all bricks 
into a single data block using the erasure code decoding function 
ec_method_merge().

The writev operation is a bit more complex and needs to create an additional 
state:

ec_gf_writev()
ec_writev()
ec_fop_data_t(): create ec_fop_data_t structure
ec_manager(): initiate state machine

ec_manager_writev() and ec_default_manager() are called for each state.

In each state, writev does the following:

EC_STATE_INIT:

Call to ec_writev_init() to align offset and size to the block size. It also 
creates an aligned contiguous buffer with the contents of the data to write 
that will be needed to encode it.

EC_STATE_DISPATCH:

Before starting the dispatch of the write fop to the underlying xlators using 
STACK_WIND(), some write operations need to do a read of some fragments of 
data before and/or after the specified offset and size of the write operation. 
This is needed when a write is not aligned to the block size.

This is done through ec_writev_start(). This function initiates a readv subfop 
just before the offset if it's not aligned, and another readv after 
offset+size if it's not aligned.

Once the readv subfops have finished, the state machine goes to the 
EC_STATE_WRITE_START, that simply does the normal processing of the 
EC_STATE_DISPATCH.

EC_STATE_DISPATCH:

In this state, ec_wind_writev() is called for each subvolume. This function 
computes the erasure code encoded data that will be sent to the brick using 
the ec_method_split() function.

EC_STATE_REBUILD:

At this state, the fop calculates the correct return code from the write 
operation and computes the resulting size of the file.

Hope this helps...

Xavi

On Monday 09 June 2014 08:15:34 Krishnan Parthasarathi wrote:
> Hi Xavi,
> 
> Following the code walk through and discussion surrounding
> erasure coding translator's implementation on #gluster-meeting,
> I wanted to ask a few questions that would make things clearer
> and help speed up the review. I am CC'ing gluster-devel in a hope
> that some of these questions might have popped in others' head
> as well.
> 
> While learning a translator I try to identify the different internal stages
> that a FOP goes through while 'inside' a xlator (ie, before a STACK_WIND or
> STACK_UNWIND transfer the control to the child/parent xlator).
> 
> Additionally, it helps to understand the points in processing of a FOP,
> the sequence of functions lead it to flow to the child(ren) xlators
> and the sequence of functions that lead it into the xlator (via callbacks).
> 
> With that context, it would help if you listed the sequence of functions,
> including the state machine functions which 'guide' the FOP through various
> sub-operations, in the following cases.
> 
> - When a inode modification call (say writev) enters cluster/ec.
> - When a readv call enters cluster/ec
> 
> This could be done by attaching gdb to the mount process, but what I am
> looking for is your notes/insights that would help us appreciate
> the design/intent better. It would also help us to notice this pattern
> in other FOPs implemented in cluster/ec.
> 
> cheers,
> Krish