[Gluster-devel] Enhancing Quota enforcement during parallel writes

Raghavendra Gowdappa rgowdapp at redhat.com
Fri May 22 05:24:03 UTC 2015



----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "Gluster Devel" <gluster-devel at gluster.org>
> Cc: "Vijaikumar Mallikarjuna" <vmallika at redhat.com>, "Sachin Pandit" <spandit at redhat.com>
> Sent: Friday, 22 May, 2015 10:50:14 AM
> Subject: Enhancing Quota enforcement during parallel writes
> 
> All,
> 
> As pointed by [1], parallel writes can result in incorrect quota enforcement.
> [2] was an (unsuccessful) attempt to solve the issue. Some points about [2]:
> 
> in_progress_writes is updated _after_ we fetch the size. Due to this, two
> writes can see the same size and hence the issue is not solved. What we
> should be doing is to update in_progress_writes even before we fetch the
> size. If we do this, it is guaranteed that at-least one write sees the
> other's size accounted in in_progress_writes. This approach has two issues:
> 
> 1. since we had added current write size to in_progress_writes, current write
> would already be accounted in the size of the directory. This is a minor
> issue and can be solved by subtracting the size of the current write from
> the resultant cluster-wide in-progress-size of the directory.
> 
> 2. We might prematurely fail the writes even though there is some space
> available. Assume there is a 5MB of free space. If two 5MB writes are issued
> in parallel, both might fail as both might see each other's size already
> accounted, though none of them has succeeded.

Of course, we can go with this limitation as we are erring on conservative side if the following logic seems too complicated.

> To solve this issue, I am
> proposing following algo:
> 
>    * we assign an identity that is unique across the cluster for each write -
>    say uuid
>    * Among all the in-progress-writes we pick a write. The policy used can be
>    a random criteria like smallest of all the uuids. So, each brick selects
>    a candidate among its own in-progress-writes _AND_ incoming candidate
>    (see the psuedocode of get_dir_size below for more clarity). It sends
>    back this candidate along with size of directory. The brick also
>    remembers the last candidate it approved. clustering translators like dht
>    pick one write among these replies, using the same logic bricks had used.
>    Now along with size we also get a candidate to choose from in-progress
>    writes. However, there might be a new write on the brick in the
>    time-window where we try to fetch size which could be the candidate. We
>    should compare the resultant cluster_wide candidate with the per-brick
>    candidate. So, the enforcement logic will be as below:
> 
> 
> /* Both enforcer and get_dir_size are executed in brick process. I've left
> out logic of get_dir_size in cluster translators like dht */
> enforcer ()
> {
>     /* Note that this logic is executed independently for each directory on
>     which quota limit is set. All the in-progress writes, sizes, candidates
>     are valid in the context of
>        that directory
>      */
> 
>     my_delta = iov_length (input_iovec, input_count);
>     my_id = getuuid();
> 
>     add_my_delta_to_in_progress_size ();
> 
>     get_dir_size (my_id, &size, &in_progress_size, &cluster_candidate);
> 
>     in_progress_size -= my_delta;
> 
>     if (((size + my_delta) < quota_limit) && ((size + in_progress_size +
>     my_delta) > quota_limit) {
> 
>           /* we've to choose among in-progress writes */
>   
>           brick_candidate = least_of_uuids
>           (directory->in_progress_write_list,
>           directory->last_winning_candidate);
> 
>           if ((my_id == cluster_candidate) && (my_id == brick_candidate)) {
>               /* 1. subtract my_delta from per-brick in-progress writes
>                  2. add my_delta to per-brick sizes of all parents
>                  3. allow-write
> 
>                  getting brick_candidate above, 1 and 2 should be done
>                  atomically
>               */
>           } else {
>               /* 1. subtract my_delta from per-brick in-progress writes
>                  2. fail_write
>                */
>     } else if ((size + my_delta) < quota_limit) {
>               /* 1. subtract my_delta from per-brick in-progress writes
>                  2. add my_delta to per-brick sizes of all parents
>                  3. allow-write
> 
>                  1 and 2 should be done atomically
>               */
>     } else {
> 
>            fail_write ();
> 
>     }
> 
> }
> 
> get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate,
> ...)
> {
>      directory->last_winning_candidate = winning_candidate = least_uuid
>      (directory->in_progress_write_list, incoming_candidate_id);
> 
>      ....
> }
> 
> Comments?
> 
> [1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html
> [2] http://review.gluster.org/#/c/6220/
> 
> regards,
> Raghavendra.


More information about the Gluster-devel mailing list