[Gluster-devel] Throttling xlator on the bricks

Wed Mar 9 03:15:35 UTC 2016

On 02/13/2016 09:06 AM, Pranith Kumar Karampuri wrote:
>
>
> On 02/13/2016 12:13 AM, Richard Wareing wrote:
>> Hey Ravi,
>>
>> I'll ping Shreyas about this today.  There's also a patch we'll need 
>> for multi-threaded SHD to fix the least-pri queuing.  The PID of the 
>> process wasn't tagged correctly via the call frame in my original 
>> patch.  The patch below fixes this (for 3.6.3), I didn't see 
>> multi-threaded self heal on github/master yet so let me know what 
>> branch you need this patch on and I can come up with a clean patch.
>
> Hi Richard,
>          I reviewed the patch and found that the same needs to be done 
> even for ec. So I am thinking if I can split it out as two different 
> patches, 1 patch in syncop-utils which builds the functionality of 
> parallelization. Another patch which uses this in afr, ec. Do you mind 
> if I give it a go? I can complete it by end of Wednesday.

Richard,
        After starting this work, I found a few things we can improve in 
this patch based on the results we got in testing.
1) We are reading all the indices before we launch self-heals. In some 
customer cases I worked on there were almost 5million files/directories 
that needed heal. With such a big number self-heal daemon will be OOM 
killed if we go this route. So I modified this to launch heals based on 
a queue length limit.

2) We found that for directory hierarchies, multi-threaded self-heal 
patch was not giving better results compared to single-threaded 
self-heal because of the order problems. We improved index xlator to 
give gfid type to make sure that all directories in the indices are 
healed before the files that follow in that iteration of readdir 
output(http://review.gluster.org/13553). In our testing this lead to 
zero errors of self-heals as we were only doing self-heals in parallel 
for files and not directories. I think we can further improve self-heal 
speed for directories by doing name heals in parallel based on similar 
techniques your patch showed. I think the best thing there would be to 
introduce synccond_t infra (pthread_cond_t kind of infra for syncops) 
which I am planning to implement for future releases.

3) Based on 1), 2) and the fact that afr already does retries of the 
indices in a loop I removed retries again in the threads.

4) After the refactor, the changes required to bring in multi-threaded 
self-heal for ec would just be ~10 lines, most of it will be about 
options initialization.

Our tests found that we are able to easily saturate network :-). WIP 
patch is at http://review.gluster.org/13569, which implements this 
functionality, you will see that quite a few of the ideas are borrowed 
from your original patch. I can't thank you enough for the patch!! I am 
planning to complete the work by end of this week. We are looking to 
take a look at your unsplit-brain patch next after this.

Pranith
>
> Pranith
>>
>> Richard
>>
>>
>> =====
>>
>>
>> diff --git a/xlators/cluster/afr/src/afr-self-heald.c 
>> b/xlators/cluster/afr/src/afr-self-heald.c
>> index 028010d..b0f6248 100644
>> --- a/xlators/cluster/afr/src/afr-self-heald.c
>> +++ b/xlators/cluster/afr/src/afr-self-heald.c
>> @@ -532,6 +532,9 @@ afr_mt_process_entries_done (int ret, 
>> call_frame_t *sync_frame,
>>                   pthread_cond_signal (&mt_data->task_done);
>>           }
>>           pthread_mutex_unlock (&mt_data->lock);
>> +
>> +        if (task_ctx->frame)
>> +                AFR_STACK_DESTROY (task_ctx->frame);
>>           GF_FREE (task_ctx);
>>           return 0;
>>   }
>> @@ -787,6 +790,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
>>           int                                   ret = -1;
>>           afr_mt_process_entries_task_ctx_t     *task_ctx;
>>           afr_mt_data_t                         *mt_data;
>> +        call_frame_t                          *frame = NULL;
>>
>>           mt_data = &healer->mt_data;
>>
>> @@ -799,6 +803,8 @@ _afr_mt_create_process_entries_task (xlator_t *this,
>>           if (!task_ctx)
>>                   goto err;
>>
>> +        task_ctx->frame = afr_frame_create (this);
>> +
>>           INIT_LIST_HEAD (&task_ctx->list);
>>           task_ctx->readdir_xl = this;
>>           task_ctx->healer = healer;
>> @@ -812,7 +818,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
>>           // This returns immediately, and 
>> afr_mt_process_entries_done will
>>           // be called when the task is completed e.g. our queue is 
>> empty
>>           ret = synctask_new (this->ctx->env, 
>> afr_mt_process_entries_task,
>> -                afr_mt_process_entries_done, NULL,
>> +                afr_mt_process_entries_done, task_ctx->frame,
>>                   (void *)task_ctx);
>>
>>           if (!ret) {
>> diff --git a/xlators/cluster/afr/src/afr-self-heald.h 
>> b/xlators/cluster/afr/src/afr-self-heald.h
>> index 817e712..1588fc8 100644
>> --- a/xlators/cluster/afr/src/afr-self-heald.h
>> +++ b/xlators/cluster/afr/src/afr-self-heald.h
>> @@ -74,6 +74,7 @@ typedef struct afr_mt_process_entries_task_ctx_ {
>>           subvol_healer_t         *healer;
>>           xlator_t                *readdir_xl;
>>           inode_t                 *idx_inode;  /* inode ref for 
>> xattrop dir */
>> +        call_frame_t            *frame;
>>           unsigned int            entries_healed;
>>           unsigned int            entries_processed;
>>           unsigned int            already_healed;
>>
>>
>> Richard
>> ________________________________________
>> From: Ravishankar N [ravishankar at redhat.com]
>> Sent: Sunday, February 07, 2016 11:15 PM
>> To: Shreyas Siravara
>> Cc: Richard Wareing; Vijay Bellur; Gluster Devel
>> Subject: Re: [Gluster-devel] Throttling xlator on the bricks
>>
>> Hello,
>>
>> On 01/29/2016 06:51 AM, Shreyas Siravara wrote:
>>> So the way our throttling works is (intentionally) very simplistic.
>>>
>>> (1) When someone mounts an NFS share, we tag the frame with a 32 bit 
>>> hash of the export name they were authorized to mount.
>>> (2) io-stats keeps track of the "current rate" of fops we're seeing 
>>> for that particular mount, using a sampling of fops and a moving 
>>> average over a short period of time.
>>> (3) Based on whether the share violated its allowed rate (which is 
>>> defined in a config file), we tag the FOP as "least-pri". Of course 
>>> this makes the assumption that all NFS endpoints are receiving 
>>> roughly the same # of FOPs. The rate defined in the config file is a 
>>> *per* NFS endpoint number. So if your cluster has 10 NFS endpoints, 
>>> and you've pre-computed that it can do roughly 1000 FOPs per second, 
>>> the rate in the config file would be 100.
>>> (4) IO-Threads then shoves the FOP into the least-pri queue, rather 
>>> than its default. The value is honored all the way down to the bricks.
>>>
>>> The code is actually complete, and I'll put it up for review after 
>>> we iron out a few minor issues.
>> Did you get a chance to send the patch? Just wanted to run some tests
>> and see if this is all we need at the moment to regulate shd traffic,
>> especially with Richard's multi-threaded heal patch
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__review.gluster.org_-23_c_13329_&d=CwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=B873EiTlTeUXIjEcoutZ6Py5KL0bwXIVroPbpwaKD8s&s=fo86UTOQWXf0nQZvvauqIIhlwoZHpRlQMNfQd7Ubu7g&e= 
>> being revived and made ready for 3.8.
>>
>> -Ravi
>>
>>>> On Jan 27, 2016, at 9:48 PM, Ravishankar N <ravishankar at redhat.com> 
>>>> wrote:
>>>>
>>>> On 01/26/2016 08:41 AM, Richard Wareing wrote:
>>>>> In any event, it might be worth having Shreyas detail his 
>>>>> throttling feature (that can throttle any directory hierarchy no 
>>>>> less) to illustrate how a simpler design can achieve similar 
>>>>> results to these more complicated (and it follows....bug prone) 
>>>>> approaches.
>>>>>
>>>>> Richard
>>>> Hi Shreyas,
>>>>
>>>> Wondering if you can share the details of the throttling feature 
>>>> you're working on. Even if there's no code, a description of what 
>>>> it is trying to achieve and how will be great.
>>>>
>>>> Thanks,
>>>> Ravi
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>