[Gluster-devel] Enhancing Quota enforcement during parallel writes

Fri May 22 05:20:14 UTC 2015

All,

As pointed by [1], parallel writes can result in incorrect quota enforcement. [2] was an (unsuccessful) attempt to solve the issue. Some points about [2]:

in_progress_writes is updated _after_ we fetch the size. Due to this, two writes can see the same size and hence the issue is not solved. What we should be doing is to update in_progress_writes even before we fetch the size. If we do this, it is guaranteed that at-least one write sees the other's size accounted in in_progress_writes. This approach has two issues:

1. since we had added current write size to in_progress_writes, current write would already be accounted in the size of the directory. This is a minor issue and can be solved by subtracting the size of the current write from the resultant cluster-wide in-progress-size of the directory.

2. We might prematurely fail the writes even though there is some space available. Assume there is a 5MB of free space. If two 5MB writes are issued in parallel, both might fail as both might see each other's size already accounted, though none of them has succeeded. To solve this issue, I am proposing following algo:

   * we assign an identity that is unique across the cluster for each write - say uuid
   * Among all the in-progress-writes we pick a write. The policy used can be a random criteria like smallest of all the uuids. So, each brick selects a candidate among its own in-progress-writes _AND_ incoming candidate (see the psuedocode of get_dir_size below for more clarity). It sends back this candidate along with size of directory. The brick also remembers the last candidate it approved. clustering translators like dht pick one write among these replies, using the same logic bricks had used. Now along with size we also get a candidate to choose from in-progress writes. However, there might be a new write on the brick in the time-window where we try to fetch size which could be the candidate. We should compare the resultant cluster_wide candidate with the per-brick candidate. So, the enforcement logic will be as below:

/* Both enforcer and get_dir_size are executed in brick process. I've left out logic of get_dir_size in cluster translators like dht */
enforcer ()
{
    /* Note that this logic is executed independently for each directory on which quota limit is set. All the in-progress writes, sizes, candidates are valid in the context of 
       that directory 
     */

    my_delta = iov_length (input_iovec, input_count);
    my_id = getuuid();

    add_my_delta_to_in_progress_size ();

    get_dir_size (my_id, &size, &in_progress_size, &cluster_candidate);

    in_progress_size -= my_delta;

    if (((size + my_delta) < quota_limit) && ((size + in_progress_size + my_delta) > quota_limit) {

          /* we've to choose among in-progress writes */

          brick_candidate = least_of_uuids (directory->in_progress_write_list, directory->last_winning_candidate);

          if ((my_id == cluster_candidate) && (my_id == brick_candidate)) {
              /* 1. subtract my_delta from per-brick in-progress writes
                 2. add my_delta to per-brick sizes of all parents
                 3. allow-write

                 getting brick_candidate above, 1 and 2 should be done atomically
              */
          } else {
              /* 1. subtract my_delta from per-brick in-progress writes
                 2. fail_write 
               */
    } else if ((size + my_delta) < quota_limit) {
              /* 1. subtract my_delta from per-brick in-progress writes
                 2. add my_delta to per-brick sizes of all parents
                 3. allow-write

                 1 and 2 should be done atomically
              */
    } else {

           fail_write ();

    }

}       

get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate, ...)
{
     directory->last_winning_candidate = winning_candidate = least_uuid (directory->in_progress_write_list, incoming_candidate_id);

     ....
}

Comments?

[1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html
[2] http://review.gluster.org/#/c/6220/

regards,
Raghavendra.