[Gluster-devel] Dispersed volumes

Thu Jan 30 12:41:07 UTC 2014

Hi,

I recently uploaded the new version of the disperse translator into 
gluster forge (https://forge.gluster.org/disperse).

It's missing some important features yet, like self-heal, and the 
stability is poor, however I'll will improve it in the following days. 
We plan to have it ready to be added into 3.6, at least as a beta feature.

I've written some documentation in asciidoc (attached to this email) of 
the components involved in the dispersed volume.

Any feedback will be appreciated.

Best regards,

Xavi
-------------- next part --------------
Disperse translator implementation
==================================
Xavier Hernandez <xhernandez at datalab.es>

The disperse translator is really a set of translators and libraries working
together to offer a dispersed volume. This new kind of volume is based on
erasure codes and it is able to reduce the wasted disk space due to redundancy
while maintaning a high level of fault tolerance.

Introduction
------------

One of the main interests in developing all these components has been to
create the modules needed to implement the dispersed volume but at the same
time trying to write functions and features generic enough to be easily usable
by other translators. With this idea in mind, the dispersed volume has been
divided in the following translators:

[horizontal]
*https://forge.gluster.org/disperse/ida[ida]*::
    The core translator of the dispersed volume. It implements all erasure
    coding and health logic.

*https://forge.gluster.org/disperse/heal[heal]*::
    A server side helper translator used by ida to aid in the healing process
    of a file.

*https://forge.gluster.org/disperse/dfc[dfc]*::
    A server side translator that coordinates multiple concurrent requests to
    the same file from multiple clients without needing any locks.

Additionally, some libraries are needed:

[horizontal]
*https://forge.gluster.org/disperse/gfsys[libgfsys]*::
    A wrapper library around some of the libglusterfs functionalities to
    simplify some tasks, and it also offers some new features, like
    asynchronous calls, fast timers and rcu support.

*https://forge.gluster.org/disperse/dfc[libgfdfc]*::
    A client side library to use the features of the dfc translator.

Libraries
---------

gfsys library
~~~~~~~~~~~~~

The gfsys library is heavily based on macros. They are needed to offer some
easy to write nice features which are also translated to very efficient code
once compiled.

Logging
^^^^^^^

The purpose of the logging macros is to offer a very short way to write log
messages. It also defines the infrastructure used to handle errors and other
things (see later).

On compile time it's possible to define the minimum log level to use by
defining the macro LOG_LEVEL. This means that any log message below the
defined level will not get compiled.

The main logging function is 'sys_log'. It is a wrapper around 'gf_log' and
other variants from 'GlusterFS':

[source,c]
sys_log(_mode, _fmt, ...)

__fmt_ is the format string, like in 'printf()'

__mode_ has the following format:

    <level>[<type>]([<repetitions>[, <arguments>][, <name>]])

Where:

*<level>*:: The importance of the message

[horizontal]
U::: Unconditional
M::: Emergency
A::: Alert
C::: Critical
E::: Error
W::: Warning
N::: Notice
I::: Information
D::: Debug
T::: Trace

*<type>*:: The type of message (optional)

[horizontal]
g::: Generic (default value if none specified)
s::: Include stack trace
m::: Report a memory allocation problem

*<repetitions>*:: How often the message should be logged (optional)

[horizontal]
all:::  Always log the message (default value if none specified)
once::: Only log the message the first time
rl:::   Rate-limited logging
none::: The message is never logged (used to suppress messages in some
       circumstances).

*<arguments>*:: Arguments needed by the '<repetitions>'. Currently only 'rl'
                needs additional arguments: the period and the maximum number
                of logged messages in that period, separated by a comma.

*<name>*:: The source of the message. If it's not specified, 'THIS\->name' is
           used.

.Examples
****
[source,c]
sys_log(E(), "Generic error message logged always");
sys_log(Ws(once), "Warning message including calling stack and logged once");
sys_log(Ig(rt, 10, 1), "Generic informational message logged at most once "
                       "every 10 time units");
sys_log(Em(), 256); // Generic error message indicating that an allocation of
                    // 256 bytes failed.
sys_log(W(all, "log"), "Warning message using 'log' as the source");
****

A special set of macros have been defined to simplify all this even more:

  log<level>[_<type>]([<arguments>,] _fmt, ...)
  nlog<level>[_<type>]([<arguments>,] <name>, _fmt, ...)

.Examples
****
[source,c]
logE("Generic error message logged always");
logWs_once("Warning message including calling stack and logged once");
logIg_rl(10, 1, "Generic informational message logged at most once every "
                "10 time units");
logEm(256); // Generic error message indicating that an allocation of 256
            // bytes failed.
nlogW("log", "Warning message using 'log' as the source");
****

Error handling
^^^^^^^^^^^^^^

Multiple macros have been defined to detect errors and manage them properly.
To do so, a set of 'actions' is defined. These actions are simple statements
that can be executed when an error condition is detected.

Currently available actions are:

*NULL([<args>])*::
    '<args>' are ignored and nothing is done. This is a NOOP.

*BREAK([<save>])*::
    Executes a 'break' statement. If '<save>' is specified, it should be a
    pointer to an err_t value which will receive the error code detected.

*CONTINUE([<save>]*::
    Executes a 'continue' statement. '<save>' has the same meaning as in
    'BREAK'.

*GOTO(<label>[, <save>])*::
    Executes a 'goto <label>' statement. '<save>' has the same meaning as in
    'BREAK'.

*RETURN([<save>])*::
    Executes a 'return' statement. '<save>' has the same meaning as in 'BREAK'.

*RETVAL(<value>[, <save>])*::
    Executes a 'return <value>' statement. '<save>' has the same meaning as in
    'BREAK'.

*RETERR([<save>])*::
    Executes a 'return <detected error code>' statement.  '<save>' has the same
    meaning as in 'BREAK'.

*SAVE(<save>)*::
    Simply stores the detected error code into '<save>'.

*LOG(<mode>, <msg>, ...)*::
    Logs the '<msg>' (with its optional formatting arguments) using the
    specified '<mode>'.

*ASSERT(<msg>, ...)*::
    Generates an assertion, logging the specified '<msg>' and forcing an
    immediate program termination.

*NO_FAIL()*::
    Special action used for memory allocation. It guarantees that the
    allocation will always succeed (or not return if failing).

A set of additional macros have been defined to allow easy handling of error
codes. Its main purpose is to make it easy to detect errors and log
something without polluting the source code with a lot of log messages. There
are two main uses for it: first for some functions that can return an error
but it's generally not important to handle it. However this error can be the
consequence of something going wrong. An example could be the 'close' function.
The second use is to track a cascade of failing functions to easily locate
where an error originated and how it propagated through the stack.

Using these marcros makes possible to only add high level logging messages on
function calls without having to take care of every return code. These macros
use the defined actions to do things when an error is detected.

SYS_TEST(<test>, <code>, <mode>, <actions...>)::
    Checks if the '<test>' is true (different than 0). If it is, nothing else
    is done. Otherwise, '<code>' will be the detected error, which will be
    logged using the '<mode>' specified. After that, all '<actions...>' will
    be processed. The action list can be empty. In this case, the macro returns
    the error code detected.

SYS_ASSERT(<test>, <msg...>)::
    The same that 'SYS_TEST', but an assertion is generated and the program
    terminated if the '<test>' fails.

SYS_CHAIN(<perror>, <statement>)::
    '<perror>' is a pointer to an error code. If the error is != 0 nothing is
    done and *<perror> is returned. Otherwise the '<statement>' is executed and
    its result returned.

SYS_CALL(<function>, <arguments>, <mode>, <actions...>)::
    The '<function>' will be called using the specified '<arguments>'. It is
    assumed that the function will return an error code. If this error is 0,
    nothing else is done and 0 is returned. Otherwise an error message is
    logged using the specified '<mode>' and the action list is executed. If
    the action list is empty or it does not have statements that change the
    flow of execution, the error code is returned.

SYS_PTR(<pptr>, <function>, <arguments>, <code>, <mode>, <actions...>)::
    It is similar to 'SYS_CALL', but it can be used with functions that return
    a pointer instead of an error code. In this case, if the returned pointer
    is NULL, the error code will be set to '<code>' and will be processed
    exactly as in 'SYS_CALL'. Otherwise it is assigned to '<pptr>'. This macro
    returns 0 or the error code specified.

SYS_ERRNO(<presult>, <function>, <arguments>, <mode>, <actions...>)::
    It works like 'SYS_CALL' but it's for functions that return a result or a
    negative number in case of error (setting errno to the appropiate error).
    The result is assigned to '<presult>' and the error code is returned.

SYS_ERRNO0(<function>, <arguments>, <mode>, <actions...>)::
    It works like 'SYS_ERRNO', but it's specific for functions returning 0 or
    a negative number in case of error (i.e. no meaningful result is returned).

SYS_ERRNO_PTR(<pptr>, <function>, <arguments>, <mode>, <actions...>)::
    The same as 'SYS_ERRNO', but for functions that return a pointer.

SYS_CODE(<function>, <arguments>, <code>, <mode>, <actions...>)::
    It's like 'SYS_ERRNO0', but for functions that do not set errno. In this
    case, if an error is detected, the error code will be set to '<code>'.

.Examples
****
[source,c]
SYS_CHAIN(&error, pthread_mutex_lock(&mutex));
SYS_CALL(pthread_mutex_lock, (&mutex), E(), ASSERT("Mutex is not valid"));
SYS_PTR(&buffer, malloc, (size), ENOMEM, E(), BREAK(&error));
SYS_ERRNO(&fd, open, (file, mode), W(), LOG(N(), "Open failed"), RETERR());
SYS_ERRNO0(close, (fd), W());
SYS_ERRNO_PTR(&str, strdup, (name), E(), RETVAL(-1));
SYS_CODE(dict_lookup, (dict, key, &data), ENOENT, W(), GOTO(failed, &error));
SYS_TEST(ptr != NULL, EINVAL, E(), LOG(E(), "Pointer is NULL"), RETURN());
SYS_ASSERT(ptr != NULL, "Pointer is NULL");
****

A more cleaner way to write these macros is like this (it takes more space but
it's much easier to see what is doing each call):

.Examples
****
[source,c]
SYS_CHAIN(
    &error,
    pthread_mutex_lock(&mutex)
);
SYS_CALL(
    pthread_mutex_lock, (&mutex),
    E(),
    ASSERT("Mutex is not valid")
);
SYS_PTR(
    &buffer,
    malloc, (size),
    ENOMEM,
    E(),
    BREAK(&error)
);
SYS_ERRNO(
    &fd,
    open, (file, mode),
    W(),
    LOG(N(), "Open failed"),
    RETERR()
);
SYS_ERRNO0(
    close, (fd),
    W()
);
SYS_ERRNO_PTR(
    &str,
    strdup, (name),
    E(),
    RETVAL(-1)
);
SYS_CODE(
    dict_lookup, (dict, key, &data),
    ENOENT,
    W(),
    GOTO(failed, &error)
);
SYS_TEST(
    ptr != NULL,
    EINVAL,
    E(),
    LOG(E(), "Pointer is NULL"),
    RETURN()
);
SYS_ASSERT(
    ptr != NULL,
    "Pointer is NULL"
);
****

Memory management
^^^^^^^^^^^^^^^^^

Memory management macros have a special feature that allows to specify that
some requests cannot fail under any circumstance. To do so, there is an
emergency memory pool that will be used when an allocation fails on normal
memory. If even the emergency pool is exhausted, the process will terminate
immediately.

Additionally, the emergency memory manager fires some events when the memory
condition is low. Any component can attach a callback to these events to try
to release some memory (for example a caching translator could use it to flush
some buffers and alleviate the problem).

Basically all macros are specialized versions of 'SYS_PTR'.

SYS_ALLOC(<pptr>, <size>, <type>, <mode>, <actions...>)::
    Allocates a memory area of the specified '<size>' and '<type>' and stores
    its address to '<pptr>'. If the allocation fails, a message is logged
    using the '<mode>' indicated and the '<actions>' are processed. It returns
    0 if the allocation succeeded or ENOMEM if it failed and the actions didn't
    change the flow of execution.

SYS_ALLOC0(<pptr>, <size>, <type>, <mode>, <actions...>)::
    Exactly the same than 'SYS_ALLOC', but the memory area is filled with 0
    before returning.

SYS_RESIZE(<pptr>, <size>, <type>, <mode>, <actions...>)::
    It resizes a previously allocated memory area to the new '<size>'.

SYS_ALLOC_ALIGNED(<pptr>, <size>, <align>, <type>, <mode>, <actions...>)::
    It works exactly as the 'SYS_ALLOC' but the returned address is aligned
    to a multiple of the '<align>' value specified.

SYS_FREE(<pptr>)::
    It releases a previously allocated memory area.

SYS_FREE_ALIGNED(<pptr>)::
    It releases a previously allocated aligned memory area.

SYS_MALLOC(<pobj>, <type>, <mode>, <actions...>)::
    It is the same than 'SYS_ALLOC', but the size is deduced from the type of
    '<pobj>'.

SYS_MALLOC0(<pobj>, <type>, <mode>, <actions...>)::
    Like 'SYS_MALLOC', but clearing the allocated buffer.

SYS_CALLOC(<pobj>, <count>, <type>, <mode>, <actions...>)::
    It allocates an array of '<count>' elements of the type deduced from
    '<pobj>'. The size is automatically computed.

SYS_CALLOC0(<pobj>, <count>, <type>, <mode>, <actions...>)::
    Like 'SYS_CALLOC', but clearing the allocated buffer.

SYS_POOL_CREATE(<ppool>, <type>, <count>, <mode>, <actions...>)::
    Creates a memory pool for objects of the specified '<type>' with a size
    for '<count>' items.

SYS_POOL_DESTROY(<ppool>)::
    Destroys a previously allocated pool.

SYS_POOL_ALLOC(<pptr>, <pool>, <type>, <mode>, <actions...>)::
    Allocates an item from the memory '<pool>' specified. If it succeeds, the
    address is stores into '<pptr>', otherwise a message is logged using the
    specified '<mode>' and '<actions>' are processed.

SYS_POOL_ALLOC0(<pptr>, <pool>, <type>, <mode>, <actions...>)::
    The same than 'SYS_POOL_ALLOC' but clearing the allocated item.

SYS_POOL_FREE(<ptr>)::
    Releases a memory pool item.

.Examples
****
[source,c]
SYS_ALLOC0(
    &buffer, 1024, gf_common_mt_char,
    E(),
    LOG("Unable to allocate a buffer"),
    RETERR()
);
SYS_MALLOC( /* Assuming that dict is declared as a dict_t * */
    &dict, gf_common_mt_dict_t,
    E(),
    BREAK(&error)
);
SYS_FREE(dict);
****

Asynchronous calls
^^^^^^^^^^^^^^^^^^

The asynchronous subsystem is based on a dynamically adjustable thread pool
that communicates through a set of spsc queues between each pair of threads.
The queues are wait-free (unless there are no items in the queue) and provide
a place to store arguments for each call, avoiding some small allocations for
short term data.

This thread pool is designed to handle execution of short and non-blocking
tasks. This condition is basic because a timer queue is implemented using this
fact. Each thread handles its timer queue, which allows the creation of many
timers without any locks, so it can be executed faster.

When a time consuming or blocking task must be executed, a secondary I/O thread
pool is used for that purpose.

Except for pure asynchronous calls, all other features must be used from a
thread inside the primary thread pool.

SYS_ASYNC(<function>, <arguments>)::
    Executes the '<function>' with its '<arguments>' asynchronously. The
    following statement can be executed before, after or concurrently with
    the '<function>'. No assumptions can be made.

SYS_DELAY(<ms>, <function>, <arguments>)::
    Executes the specified '<function>' with its '<arguments>' asynchronously
    after waiting for '<ms>' milliseconds. The following statement is executed
    immediatelly.

SYS_LOCK(<lock>, <function>, <arguments>)::
    Acquires a lock asynchronously. When the lock is acquired, the '<function>'
    is called. The following statement is executed immediately, but the
    '<function>' can have been executed before, after or concurrently with it.

SYS_UNLOCK(<lock>)::
    Releases an acquired lock.

SYS_IO(<function>, <arguments>, <callback>[, <timer>])::
    Queues a '<function>' to be executed asynchronously from an I/O thread.
    When the function completes, a '<callback>' is executed. Optionally a
    '<timer>' can be specified to be notified if the operation takes too long.
    The timer is a 'SYS_DELAY' that will be cancelled if the I/O operation ends
    in time. The timer does not cancel the pending I/O, which will have to
    finish sometime. It only lets you to specify something to do if it takes
    too long.

SYS_CBK(<function>, <arguments>)::
    Creates a callback usable by 'SYS_IO'. When the operation initiated by
    'SYS_IO' finishes, the '<function>' will be called with the specified
    '<arguments>'.

SYS_RCU(<function>, <arguments>)::
    Queues a '<function>' to be executed when all threads in the primary thread
    pool have passed a quiesce state. This is useful to implement faster data
    structures, specially for read operations that can be done without locking
    and 0 overhead.

[NOTE]
===============================================================================
An interesting feature of asynchronous calls is that they can be called
synchronously simply be calling the function directly.

[source,c]
----
// Asynchronous execution
SYS_ASYNC(test_function, ("test", 20, &data));

// Synchronous execution
test_function("test", 20, &data);
----
===============================================================================

To use all these macros, it is necessary to declare the target functions in a
special way. First of all, the function arguments declaration needs to be
written in a different way: each argument must be enclosed in parentheses, with
the type and name separated by a comma.

For example, the following declaration:

[source,c]
void test_function(char * str, int number, data_struct_t * ptr);

Should have its arguments defined as:

[source,c]
((char *, str), (int, number), (data_struct_t *, ptr))

For special use cases, additional values can be specified for each argument to
let it be handled in a different way. There are two main types: PTR and COPY.
PTR is used to indicate that we want to pass a parameter by value, even if it
is a pointer. COPY is intended for arguments that need to be duplicated or its
reference count needs to be incremented/decremented. In both cases two
additional fields must be supplied with the functions that perform the
assignment and release of the argument.

.Example
****
[source,c]
(
    (char *, str, COPY, test_strdup, test_free),
    (int, number),
    (data_struct_t, ptr, PTR, test_data_copy, test_data_release)
)
****

Note that when using PTR, the type is not declared as a pointer.

In this case test_strdup() will receive a pointer to a char *, where it should
put the new value, and a char * containing the original data. test_free()
simply receives the char * to release. test_data_copy() receives a pointer to
the destination structure and a pointer to the source one. test_data_release()
simply receives the pointer to the structure to release.

.Example
****
[source,c]
----
void test_dict_ref(dict_t ** dst, dict_t * src)
{
    if (src != NULL)
    {
        src = dict_ref(src);
    }
    *dst = src;
}

void test_dict_unref(dict_t * src)
{
    if (src != NULL)
    {
        dict_unref(src);
    }
}

void test_iatt_copy(struct iatt * dst, struct iatt * src)
{
    if (src != NULL)
    {
        *dst = *src;
    }
    else
    {
        memset(dst, 0, sizeof(*dst));
    }
}

void test_iatt_release(struct iatt * src)
{
    // Nothing to do
}

SYS_ASYNC_DECLARE(test_function, ((struct iatt, buf, PTR, test_iatt_copy,
                                                          test_iatt_release),
                                  (dict_t *, xdata, COPY, test_dict_ref,
                                                          test_dict_unref)));
----
****

This method can be used to automate the ref/unref of some structures every
time any of these arguments is used in asynchronous function calls.

Using this syntax for the arguments, the following macros can be used to
declare and define asynchronous functions:

SYS_ASYNC_DECLARE(<name>, <arguments>)::
    Declares an asynchronous function named '<name>' that can be called by
    'SYS_ASYNC'.

SYS_ASYNC_DEFINE(<name>, <arguments>)::
    Defines an asynchronous function named '<name>'. It must have been
    previously declared with 'SYS_ASYNC_DECLARE'. The body of the function
    must be written immediately following this macro, enclosed between `{' and
    `}'.

SYS_ASYNC_CREATE(<name>, <arguments>)::
    It's a short hand for 'SYS_ASYNC_DECLARE' followed by a 'SYS_ASYNC_DEFINE'.

SYS_DELAY_DECLARE(<name>, <arguments>)::
    Declares a delayed function named '<name>' callable by 'SYS_DELAY'.

SYS_DELAY_DEFINE(<name>, <arguments>)::
    Defines the delayed function.

SYS_DELAY_CREATE(<name>, <arguments>)::
    It's a short hand for 'SYS_DELAY_DECLARE' followed by a 'SYS_DELAY_DEFINE'.

SYS_LOCK_DECLARE(<name>, <arguments>)::
    Declares a function to be called when a lock is acquired by 'SYS_LOCK'.

SYS_LOCK_DEFINE(<name>, <arguments>)::
    Defines the locked function.

SYS_LOCK_CREAETE(<name>, <arguments>)::
    It's a short hand for 'SYS_LOCK_DECLARE' followed by a 'SYS_LOCK_DEFINE'.

SYS_RCU_DECLARE(<name>, <arguments>)::
    Declares a function to be called by 'SYS_RCU'.

SYS_RCU_DEFINE(<name>, <arguments>)::
    Defines the rcu function.

SYS_RCU_CREATE(<name>, <arguments>)::
    It's a short hand for 'SYS_RCU_DECLARE' followed by a 'SYS_RCU_DEFINE'.

I/O management is a bit more complex. When a I/O function is declared it also
needs to declare the arguments that will return its execution. For example if
you want to create an asynchronous wrapper around the readv fop, you will need
to declare a function that receives an fd, a size, an offset, some flags and
a dictionary as input arguments, and have have an iovec struct, a count, an
iatt struct, an iobref struct and a dictionary as output arguments. This output
arguments will be passed to the callback function as a pointer.

SYS_IO_DECLARE(<name>, <io arguments>, <cbk arguments>)::
    Declares a function to be called by 'SYS_IO'. This function will have
    '<io arguments>' as its arguments and '<cbk arguments>' is the list of
    arguments that the callback will receive.

SYS_IO_DEFINE(<name>, <io data>, <io arguments>, <cbk arguments>)::
    Defines an I/O function named '<name>', which will be the responsible to
    execute the actual I/O operation using the '<io arguments>'. '<io data>' is
    a pointer that will be used to manage the current operation. It needs to
    be passed to 'SYS_IO_RESUME' when the I/O finishes.

SYS_IO_CREATE(<name>, <io data>, <io arguments>, <cbk arguments)::
    It's a short hand for 'SYS_IO_DECLARE' followed by a 'SYS_IO_DEFINE'.

SYS_IO_RESUME(<name>, <io data>, <error>, <arguments...>)::
    This macro is executed when the I/O has finished and the result arguments
    (those defined as '<cbk arguments>' in 'SYS_IO_DECLARE') are available.
    '<io data>' is the pointer to the current operation, '<error>' indicates
    if something failed, and will be used to indicate the global result of the
    I/O operation but it won't be passed to the callback function.
    '<arguments...>' are the actual values of the result of the I/O. This macro
    basically calls the callback function.

SYS_CBK_DECLARE(<name>, <arguments>)::
    Declares a function that will handle an I/O result. '<arguments>' are
    additional and custom parameters that the function will receive besides the
    I/O result arguments.

SYS_CBK_DEFINE(<name>, <io data>, <arguments>)::
    Defines the callback function. '<arguments>' are tha data passed when
    definig the callback with 'SYS_CBK'. '<io data>' represents the I/O
    operation and can be used to access the result arguments. This value can
    be casted to `SYS_IO_CBK_TYPE(<name>) *' to access the data.

SYS_CBK_CREATE(<name>, <io data>, <arguments>)::
    It's a short hand for 'SYS_CBK_DECLARE' followed by a 'SYS_CBK_DEFINE'.

.Example
****
[source,c]
----
void dict_copy_ref(dict_t ** dst, dict_t * src)
{
    if (src != NULL)
    {
        src = dict_ref(src);
    }
    *dst = src;
}

void dict_copy_unref(dict_t * src)
{
    if (src != NULL)
    {
        dict_unref(src);
    }
}

void iatt_ptr_copy(struct iatt * dst, struct iatt * src)
{
    if (dst != NULL)
    {
        *dst = *src;
    }
    else
    {
        memset(dst, 0, sizeof(*dst));
    }
}

void iatt_ptr_free(struct iatt * src)
{
}

void iobref_copy_ref(struct iobref ** dst, struct iobref * src)
{
    if (src != NULL)
    {
        src = iobref_ref(src);
    }
    *dst = src;
}

void iobref_copy_unref(struct iobref * src)
{
    if (src != NULL)
    {
        iobref_unref(src);
    }
}

SYS_IO_CREATE(simple_io, io, ((int, fd), (void *, buf), (size_t, count)),
                             ((ssize_t, length), (int, op_errno)))
{
    ssize_t length;

    length = write(df, buf, count);

    SYS_IO_RESUME(simple_io, io, (length < 0) ? errno : 0, length, errno);
}

int32_t writev_cbk(call_frame_t * frame, void * cookie, xlator_t * xl,
                   int32_t op_ret, int32_t op_errno, struct iatt * prebuf,
                   struct iatt * postbuf, dict_t * xdata)
{
    SYS_IO_RESUME(complex_io, cookie, (op_ret < 0) ? op_errno : 0,
                  frame, xl, op_ret, op_errno, prebuf, postbuf, xdata);
}

SYS_IO_CREATE(complex_io, io,
    (
        (call_frame_t *,  frame),
        (xlator_t *,      xl),
        (fd_t *,          fd),
        (void *,          buf),
        (size_t,          size),
        (off_t,           offset),
        (uint32_t,        flags),
        (struct iobref *, iobref, COPY, iobref_copy_ref, iobref_copy_unref),
        (dict_t *,        xdata, COPY, dict_copy_ref, dict_copy_unref)
    ),
    (
        /* Not all cbk arguments used, only added for completeness */
        (call_frame_t *, frame),
        (xlator_t *,     xl),
        (int32_t,        op_ret),
        (int32_t,        op_errno),
        (struct iatt,    prebuf, PTR, iatt_ptr_copy, iatt_ptr_free),
        (struct iatt,    postbuf, PTR, iatt_ptr_copy, iatt_ptr_free),
        (dict_t *,       xdata, COPY, dict_copy_ref, dict_copy_unref)
    )
)
{
    struct iovec iov;

    iov.iov_base = buf;
    iov.iov_len = size;

    STACK_WIND_COOKIE(frame, writev_cbk, io, FIRST_CHILD(xl),
                      FIRST_CHILD(xl)->fops->writev, fd, &iov, 1, offset,
                      flags, iobref, xdata);
}

SYS_CBK_CREATE(simple_io_cbk, io, ((int, id)))
{
    SYS_IO_CBK_TYPE(simple_io) * args;

    args = (SYS_IO_CBK_TYPE(simple_io) *)io;
    if (args->length < 0)
    {
        printf("Operation %u: Failed: %d\n", id, args->op_errno);
    }
    else
    {
        printf("Operation %u: Succeeded: %lu bytes written\n",
               id, args->length);
    }
}

SYS_CBK_CREATE(complex_io_cbk, io, ((int, fd)))
{
    SYS_IO_CBK_TYPE(complex_io) * args;

    args = (SYS_IO_CBK_TYPE(complex_io) *)io;
    if (args->op_ret < 0)
    {
        printf("Operation %u: Failed: %d\n", id, args->op_errno);
    }
    else
    {
        printf("Operation %u: Succeeded: %lu bytes written\n",
               id, args->op_ret);
    }
}

SYS_DELAY_CREATE(io_timeout, ((int, id)))
{
    printf("Operation %u: Finished with error %d\n",
           id, sys_thread_get_error());
}

void simple_test(int fd, int num, void ** buf, size_t * count)
{
    int i;

    for (i = 0; i < num; i++)
    {
        SYS_IO(
            simple_io, (fd, buf, count),
            SYS_CBK(simple_io_cbk, (i)),
            SYS_DELAY(1000, io_timeout, (i))
        );
    }
}

void complex_test(call_frame_t * frame, xlator_t * xl, fd_t * fd,
                  struct iovec * iov, int count, off_t offset, uint32_t flags,
                  struct iobref * iobref, dict_t * xdata)
{
    int i;
    off_t offs;

    offs = offset;
    for (i = 0; i < count; i++)
    {
        SYS_IO(
            complex_io, (frame, xl, fd, iov[i].iov_base, iov[i].iov_len,
                         offs, flags, iobref, xdata),
            SYS_CBK(complex_io_cbk, (i)),
            SYS_DELAY(1000, io_timeout, (i))
        );
        offset += iov[i].iov_len;
    }
}
----
****

Atomic functions
^^^^^^^^^^^^^^^^

Multiple atomic operations are implemented using the available resources on
target architecture. It's ready to support memory ordering specification if
the compiler supports it.

Thread functions
^^^^^^^^^^^^^^^^

A set of easy to use wrapper functions have been created to simplify thread
creation and manipulation.

gfdfc library
~~~~~~~~~~~~~

This library is used by client-side translators to interact with the dfc
translator on the server side. It basically offers a pair of initialization
functions and some other functions to attach dfc information to each request.

dfc_initialize(<xlator>, <max req>, <req>, <notify>, <dfc ptr>)::
    It should be called from the 'init()' function of the translator to
    prepare the dfc client resources. '<xlator>' should be the current
    translator. '<req>' is the initial number of requests sent to each brick
    to handle sort requests, and '<max req>' is the maximum number that is
    allowed to be in progress. '<notify>' is a callback function that will
    get called when the dfc library determines that a subvolume is ready to
    work. Finally, '<dfc ptr>' receives a pointer to a dfc_t structure needed
    for other calls to the dfc library.

dfc_default_notify(<dfc>, <xlator>, <event>, <data>)::
    This function should be called from the xlator's 'notify()' function to
    handle subvolume up and down events. When a subvolume reports it's up, the
    dfc library needs to do some additional work before finally letting it be
    up. This function will handle this extra work and it will later call the
    '<notify>' function referenced in 'dfc_initialize', when the translator can
    finally report up that himself is up if needed. '<xlator>', '<event>' and
    '<data>' are the same arguments that receives a typical 'notify' function.

dfc_begin(<dfc>, <txn ptr>)::
    Initiates a new dfc transaction. '<txn ptr>' receives the new transaction
    to be used in subsequent calls.

dfc_attach(<txn>, <xdata dict>)::
    Attaches the transaction information to the xdata speficied. This xdata
    should be the one used to do some request to one or more bricks.

dfc_end(<txn>, <count>)::
    Finishes a transaction, specifying how many requests have been sent. If
    there has been some problem and no request has been attached to the
    transaction, '<count>' should be 0.

dfc_complete(<txn>)::
    Completes a request from a transaction. When the last request of a
    transaction is completed, the resources are released.

Translators
-----------

dfc translator
~~~~~~~~~~~~~~

The purpose of this translator is to determine the correct order of execution
of requests comming from different clients and serialize them. The resulting
order of execution will be the same on every brick, which basically allows to
execute modification requests without the need of having locked the inode or
the entry.

For this to work, each client generates a unique id that identifies it to
the bricks. That id is a numeric value that can be compared with other id's.
This order is important because a client with lower id will have priority in
case of conflict.

The basic idea of this translator is that every request coming from a client
will have a sequential transaction id assigned. Each client will have an
independent counter. When a brick receives a request from a client, it looks
which requests from other clients have arrived before that touches the same
inode. This generates a list of clients with an id that is sent back to the
client.

When the client receives these lists from all bricks, it analyzes it and
generates a list of dependecies for the transaction. This is done by analyzing
the transaction id's from all other clients. This uses basically two
rules:

. If a client has a lower id than the current client, the maximum of the
  transaction id's received is used as a dependency.
. If a client has a higher id than the current client, the minimum of the
  transaction id's received is used as a dependency.

.Example
****
Supose that client 3 (C3) sends a request with transaction id 10 to 4 bricks
(B1, B2, B3 and B4).

Brick B1 notifies that this transaction (C3.10) depends on transaction 4 of
client 1, transaction 8 of client 2 and transaction 16 of client 4.

  B1: C3.10 depends on C1.4, C2.8 and C4.16

Similarly, the other bricks say the following:

  B2: C3.10 depends on C1.5, C2.7 and C4.15
  B3: C3.10 depends on C1.5, C2.8 and C4.17
  B4: C3.10 depends on C1.4, C2.7 and C4.16

Then the C3 client computes the final dependency list. Since C1 and C2 have
lower id, they have priority with its requests, so the maximum transaction id
received from those clients will be chosen. For C4, which have less priority,
the minimum of its transaction id's will be selected:

  Final dependencies: C3.10 depends on C1.5, C2.8 and C4.15
****

Once this dependecy list is completed, it is sent to the bricks. Each brick
uses this list to create a dependency graph. This graph is used to create a
strict order in which the transactions will be executed. An implicit rule is
that transactions coming from a single client must be executed in ascending
order of transaction id.

There is a possibility that some transactions create a circular dependecy. In
this case, the transaction corresponding to the client with higher priority
will be executed first.

.Example
****
If we get the following graph:

    C1.5  depends on C2.8, C3.11 and C4.15
    C2.8  depends on C1.6, C3.10 and C4.15
    C3.10 depends on C1.5, C2.8 and C4.15

We have that C1.5 depends on C2.8, which depends on C3.10, which depends on
C1.5.

In this case, C1 is the client with highest priority, so C1.5 will be executed
first (i.e. the dependency of C2.8 from C1.5 is dropped).
****

When a client makes use of the dfc's features through the gfdfc library, it
sends a bunch of special getxattr requests to each brick that won't be answered
immediately. In fact these requests will be used later by the dfc translator to
send the dependencies of each request to the originating client. Each of these
requests can contain information from multiple transaction, reducing the total
number of packets sent.

All the operation of dfc is based on a set of specific dictionary entries that
can be attached to each request. If a request does not have these entries, it
is passed down to the next translator without modification. When these entries
are present, the dfc translator extracts basically two fields: the client id
and the request id from which it can generate the needed information to create
the dependency graph.

heal translator
~~~~~~~~~~~~~~~

The heal translator offers aid to clients to help healing a file by allowing
concurrent modification of the file while the heal process is working. It does
this by identifying which writes are intended for healing and which ones are
normal writes.

When a client detects that a file needs to be healed, it requests heal access
to that file. If there isn't any other healing the file, this translator will
grant the heal access to the client. Only one client can be healing a file at
any time, however multiple clients can be healing different files.

Once a client has been granted heal access, it can begin to do writes to this
file with only one restriction: do not send a write request before having got
the answer to the previous one.

The heal translator maintains a list of segments that need to be healed.
Initially, this list contains a single segment that covers the full file. Every
time that a healing or normal write arrives, the written area is removed from
the list of pending segments, which may remove or modify an existing segment or
split one into two. The difference is that normal writes are always allowed,
but healing writes are only allowed if they are fully contained in the list of
pending segments. If there is any fragment outside this list, that fragment
will be silently ignored.

Since normal writes will never come concurrently because a previous locking
mechanism will have serialized the requests, the translator will only have to
deal with a maximum of one normal write with one heal write.

ida translator
~~~~~~~~~~~~~~

This is the translator that implements the erasure code. It's primary task is
to encode the recevied data in write requests and send it to its subvolumes,
and to rebuild the original data when a subset of fragments are received.

This transaltor is based on Rabin's IDA (Information Dispersal Algorithm). It
uses a matrix with elements in the Galois field GF(2^8). This can accomodate
up to 255 bricks per dispersed subvolume.

The size of the matrix is NxM, where M is the number of bricks of the subvolume
and N is the minimum number of bricks that must be alive to recover the data,
this means that M-N bricks can fail without interrupting the volume operation.

The value of each element of the matrix is i^(j - 1), where `i' is the row and
`j' is the column:

  | 1^0 1^1 1^2 ... |
  | 2^0 2^1 2^2 ... |
  | 3^0 3^1 3^2 ... |
  | ... ... ... ... |

This guarantees all the requisites needed to recover the original information
whatever subset of M - N bricks fail.

All multiplications in the Galois field have been optimized using fast xors
with SSE2 registers, that allows an average of 128 8-bit multiplications in
~20 machine cycles.

Installation
------------

Currently all modules from the disperse volume are compiled outside the main
tree of GlusterFS so there isn't any integration with gluster cli to allow the
creation and manipulation of these volumes. Manual vol-file modifications are
needed.

Prerequisites
~~~~~~~~~~~~~

To compile and install the new modules, you need to configure, compile and
install GlusterFS from source.

The current version is only tested on 64 bit intel platforms, and probably it
won't even compile on other architectures, even for intel 32 bits.

Procedure
~~~~~~~~~

To get a working environment for the dispersed volume, follow these steps:

----
# mkdir disperse
# cd disperse
# git clone git://forge.gluster.org/disperse/gfsys.git
# git clone git://forge.gluster.org/disperse/dfc.git
# git clone git://forge.gluster.org/disperse/heal.git
# git clone git://forge.gluster.org/disperse/ida.git
# cd gfsys
# ./autogen.sh
# ./configure --with-glusterfs=<path to glusterfs source>
# make install
# cd ../dfc
# ./autogen.sh
# ./configure --with-glusterfs=<path to glusterfs source> \
              --with-gfsys=`pwd`/../gfsys
# make install
# cd ../heal
# ./autogen.sh
# ./configure --with-glusterfs=<path to glusterfs source>
# make install
# cd ../ida
# . /autogen.sh
# ./configure --with-glusterfs=<path to glusterfs source> \
              --with-gfsys=`pwd`/../gfsys/src \
              --with-gfdfc=`pwd`/../dfc/lib
# make install
----

Creating a dispersed volume
~~~~~~~~~~~~~~~~~~~~~~~~~~~

For the time being, to create a volume you will need to modify the volume files
manually. The easiest way to do so is to create a replicated volume using the
command line interface and then do modifications on the generated files and use
manual launching to start the bricks and mount the volume.

First create a replicated volume with the same number of bricks that you want
to use in the dispersed volume:

----
# gluster volume create dispersed replica 3 server1:/bricks/dispersed \
                                            server2:/bricks/dispersed \
                                            server3:/bricks/dispersed
----

Then copy the generated volume file for fuse and do the following changes on
the subvolume corresponding to replicate:

====
----
 volume dispersed-replicate-0
-    type cluster/replicate
+    type cluster/disperse
+    option size 3:1
     subvolumes dispersed-client-0 dispersed-client-1 dispersed-client-2
 end-volume
----
====

Similarly, take one of the brick files and do the following changes:

====
----
 volume dispersed-access-control
     type features/access-control
     subvolumes dispersed-posix
 end-volume

+volume dispersed-heal
+    type features/heal
+    subvolumes dispersed-access-control
+end-volume
+
 volume dispersed-locks
     type features/locks
-    subvolumes dispersed-access-control
+    subvolumes dispersed-heal
 end-volume
----
====

====
----
 volume dispersed-marker
     type features/marker
     option quota off
     option xtime off
     option timestamp-file /var/lib/glusterd/vols/dispersed/marker.tstamp
     option volume-uuid 8c96a803-9471-4236-ad5b-0e24d9c724b3
     subvolumes dispersed-index
 end-volume

+volume dispersed-dfc
+    type features/dfc
+    subvolumes dispersed-marker
+end-volume
+
 volume /bricks/dispersed
     type debug/io-stats
     option count-fop-hits off
     option latency-measurement off
-    subvolumes dispersed-marker
+    subvolumes dispersed-dfc
 end-volume
----
====

With these files modified, you can start the volume manually using the
following command on each brick:

----
# glusterfsd -L INFO -l - -N -f <path to volfile for brick>
----

And finally mount the volume on one server using this command:

----
# glusterfs -L INFO -l - -N -f <path to volfile for fuse> <mount point>
----