[Gluster-devel] Enhancing Quota enforcement during parallel writes
Raghavendra Gowdappa
rgowdapp at redhat.com
Fri May 22 05:24:03 UTC 2015
----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "Gluster Devel" <gluster-devel at gluster.org>
> Cc: "Vijaikumar Mallikarjuna" <vmallika at redhat.com>, "Sachin Pandit" <spandit at redhat.com>
> Sent: Friday, 22 May, 2015 10:50:14 AM
> Subject: Enhancing Quota enforcement during parallel writes
>
> All,
>
> As pointed by [1], parallel writes can result in incorrect quota enforcement.
> [2] was an (unsuccessful) attempt to solve the issue. Some points about [2]:
>
> in_progress_writes is updated _after_ we fetch the size. Due to this, two
> writes can see the same size and hence the issue is not solved. What we
> should be doing is to update in_progress_writes even before we fetch the
> size. If we do this, it is guaranteed that at-least one write sees the
> other's size accounted in in_progress_writes. This approach has two issues:
>
> 1. since we had added current write size to in_progress_writes, current write
> would already be accounted in the size of the directory. This is a minor
> issue and can be solved by subtracting the size of the current write from
> the resultant cluster-wide in-progress-size of the directory.
>
> 2. We might prematurely fail the writes even though there is some space
> available. Assume there is a 5MB of free space. If two 5MB writes are issued
> in parallel, both might fail as both might see each other's size already
> accounted, though none of them has succeeded.
Of course, we can go with this limitation as we are erring on conservative side if the following logic seems too complicated.
> To solve this issue, I am
> proposing following algo:
>
> * we assign an identity that is unique across the cluster for each write -
> say uuid
> * Among all the in-progress-writes we pick a write. The policy used can be
> a random criteria like smallest of all the uuids. So, each brick selects
> a candidate among its own in-progress-writes _AND_ incoming candidate
> (see the psuedocode of get_dir_size below for more clarity). It sends
> back this candidate along with size of directory. The brick also
> remembers the last candidate it approved. clustering translators like dht
> pick one write among these replies, using the same logic bricks had used.
> Now along with size we also get a candidate to choose from in-progress
> writes. However, there might be a new write on the brick in the
> time-window where we try to fetch size which could be the candidate. We
> should compare the resultant cluster_wide candidate with the per-brick
> candidate. So, the enforcement logic will be as below:
>
>
> /* Both enforcer and get_dir_size are executed in brick process. I've left
> out logic of get_dir_size in cluster translators like dht */
> enforcer ()
> {
> /* Note that this logic is executed independently for each directory on
> which quota limit is set. All the in-progress writes, sizes, candidates
> are valid in the context of
> that directory
> */
>
> my_delta = iov_length (input_iovec, input_count);
> my_id = getuuid();
>
> add_my_delta_to_in_progress_size ();
>
> get_dir_size (my_id, &size, &in_progress_size, &cluster_candidate);
>
> in_progress_size -= my_delta;
>
> if (((size + my_delta) < quota_limit) && ((size + in_progress_size +
> my_delta) > quota_limit) {
>
> /* we've to choose among in-progress writes */
>
> brick_candidate = least_of_uuids
> (directory->in_progress_write_list,
> directory->last_winning_candidate);
>
> if ((my_id == cluster_candidate) && (my_id == brick_candidate)) {
> /* 1. subtract my_delta from per-brick in-progress writes
> 2. add my_delta to per-brick sizes of all parents
> 3. allow-write
>
> getting brick_candidate above, 1 and 2 should be done
> atomically
> */
> } else {
> /* 1. subtract my_delta from per-brick in-progress writes
> 2. fail_write
> */
> } else if ((size + my_delta) < quota_limit) {
> /* 1. subtract my_delta from per-brick in-progress writes
> 2. add my_delta to per-brick sizes of all parents
> 3. allow-write
>
> 1 and 2 should be done atomically
> */
> } else {
>
> fail_write ();
>
> }
>
> }
>
> get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate,
> ...)
> {
> directory->last_winning_candidate = winning_candidate = least_uuid
> (directory->in_progress_write_list, incoming_candidate_id);
>
> ....
> }
>
> Comments?
>
> [1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html
> [2] http://review.gluster.org/#/c/6220/
>
> regards,
> Raghavendra.
More information about the Gluster-devel
mailing list