[Gluster-devel] A healing translator

Tue May 8 09:34:35 UTC 2012

Hello developers,

I would like to expose some ideas we are working on to create a new kind 
of translator that should be able to unify and simplify to some extent 
the healing procedures of complex translators.

Currently, the only translator with complex healing capabilities that we 
are aware of is AFR. We are developing another translator that will also 
need healing capabilities, so we thought that it would be interesting to 
create a new translator able to handle the common part of the healing 
process and hence to simplify and avoid duplicated code in other 
translators.

The basic idea of the new translator is to handle healing tasks nearer 
the storage translator on the server nodes instead to control everything 
from a translator on the client nodes. Of course the heal translator is 
not able to handle healing entirely by itself, it needs a client 
translator which will coordinate all tasks. The heal translator is 
intended to be used by translators that work with multiple subvolumes.

I will try to explain how it works without entering into too much details.

There is an important requisite for all client translators that use 
healing: they must have exactly the same list of subvolumes and in the 
same order. Currently, I think this is not a problem.

The heal translator treats each file as an independent entity, and each 
one can be in 3 modes:

1. Normal mode

    This is the normal mode for a copy or fragment of a file when it is
    synchronized and consistent with the same file on other nodes (for
    example with other replicas. It is the client translator who decides
    if it is synchronized or not).

2. Healing mode

    This is the mode used when a client detects an inconsistency in the
    copy or fragment of the file stored on this node and initiates the
    healing procedures.

3. Provider mode (I don't like very much this name, though)

    This is the mode used by client translators when an inconsistency is
    detected in this file, but the copy or fragment stored in this node
    is considered good and it will be used as a source to repair the
    contents of this file on other nodes.

Initially, when a file is created, it is set in normal mode. Client 
translators that make changes must guarantee that they send the 
modification requests in the same order to all the servers. This should 
be done using inodelk/entrylk.

When a change is sent to a server, the client must include a bitmap mask 
of the clients to which the request is being sent. Normally this is a 
bitmap containing all the clients, however, when a server fails for some 
reason some bits will be cleared. The heal translator uses this bitmap 
to early detect failures on other nodes from the point of view of each 
client. When this condition is detected, the request is aborted with an 
error and the client is notified with the remaining list of valid nodes. 
If the client considers the request can be successfully server with the 
remaining list of nodes, it can resend the request with the updated bitmap.

The heal translator also updates two file attributes for each change 
request to mantain the "version" of the data and metadata contents of 
the file. A similar task is currently made by AFR using xattrop. This 
would not be needed anymore, speeding write requests.

The version of data and metadata is returned to the client for each read 
request, allowing it to detect inconsistent data.

When a client detects an inconsistency, it initiates healing. First of 
all, it must lock the entry and inode (when necessary). Then, from the 
data collected from each node, it must decide which nodes have good data 
and which ones have bad data and hence need to be healed. There are two 
possible cases:

1. File is not a regular file

    In this case the reconstruction is very fast and requires few
    requests, so it is done while the file is locked. In this case, the
    heal translator does nothing relevant.

2. File is a regular file

    For regular files, the first step is to synchronize the metadata to
    the bad nodes, including the version information. Once this is done,
    the file is set in healing mode on bad nodes, and provider mode on
    good nodes. Then the entry and inode are unlocked.

When a file is in provider mode, it works as in normal mode, but refuses 
to start another healing. Only one client can be healing a file.

When a file is in healing mode, each normal write request from any 
client are handled as if the file were in normal mode, updating the 
version information and detecting possible inconsistencies with the 
bitmap. Additionally, the healing translator marks the written region of 
the file as "good".

Each write request from the healing client intended to repair the file 
must be marked with a special flag. In this case, the area that wants to 
be written is filtered by the list of "good" ranges (if there are any 
intersection with a good range, it is removed from the request). The 
resulting set of ranges are propagated to the lower translator and added 
to the list of "good" ranges but the version information is not updated.

Read requests are only served if the range requested is entirely 
contained into the "good" regions list.

There are some additional details, but I think this is enough to have a 
general idea of its purpose and how it works.

The main advantages of this translator are:

1. Avoid duplicated code in client translators
2. Simplify and unify healing methods in client translators
3. xattrop is not needed anymore in client translators to keep track of 
changes
4. Full file contents are repaired without locking the file
5. Better detection and prevention of some split brain situations as 
soon as possible

I think it would be very useful. It seems to me that it works correctly 
in all situations, however I don't have all the experience that other 
developers have with the healing functions of AFR, so I will be happy to 
answer any question or suggestion to solve problems it may have or to 
improve it.

What do you think about it ?

Thank you,

Xavi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20120508/3ab46bbd/attachment-0003.html>