[Gluster-devel] tiering: emergency demotions

Thu Oct 13 11:53:48 UTC 2016

Dilemma:
*without* my patch, the demotions in degraded (hi-watermark breached)
mode happen every 10 seconds by listing *all* files colder than the
last 10 seconds and sorting them in ascending order w.r.t. the
(write,read) access time ... so the existing query could take more than
a minute to list files if there are millions of them

*with* my patch we currently select a random set of 20 files and demote
them ... even if they are actively used ... so we either wait for more
than a minute for the exact listing of cold files in the worst case or
trade off by demoting hot files without imposing a file selection
criteria for a quicker turnaround time

The exponential time window schema to select files discussed over Google 
Hangout has an issue with deciding the start time of the time window, 
although we know the end time being the current time

So, I think it would be either of the strategies discussed above with a 
trade-off in one way or the other.

Comments are requested regarding the approach to take for the
implementation.

Rafi has also suggested to avoid file creation on the hot tier if the
hot tier has hi-watermark breached to avoid further stress on storage
capacity and eventual file migration to the cold tier.

Do we introduce demotion policies like "strict" and "approximate" to
let user choose the demotion strategy ?
1. strict
    Choosing this strategy could mean we wait for the full and ordered
    query to complete and only then start demoting the coldest file first

2. approximate
    Choosing this strategy could mean we choose the the first available
    file from the database query and demote it even if it is hot and
    actively written to

Milind

On 08/12/2016 08:25 PM, Milind Changire wrote:
> Patch for review: http://review.gluster.org/15158
>
> Milind
>
> On 08/12/2016 07:27 PM, Milind Changire wrote:
>> On 08/10/2016 12:06 PM, Milind Changire wrote:
>>> Emergency demotions will be required whenever writes breach the
>>> hi-watermark. Emergency demotions are required to avoid ENOSPC in case
>>> of continuous writes that originate on the hot tier.
>>>
>>> There are two concerns in this area:
>>>
>>> 1. enforcing max-cycle-time during emergency demotions
>>>    max-cycle-time is the time the tiering daemon spends in promotions or
>>>    demotions
>>>    I tend to think that the tiering daemon skip this check for the
>>>    emergency situation and continue demotions until the watermark drops
>>>    below the hi-watermark
>>
>> Update:
>> To keep matters simple and manageable, it has been decided to *enforce*
>> max-cycle-time to yield the worker threads to attend to impending tier
>> management tasks if the need arises.
>>
>>>
>>> 2. file demotion policy
>>>    I tend to think that evicting the largest file with the most recent
>>>    *write* should be chosen for eviction when write-freq-threshold is
>>>    NON-ZERO.
>>>    Choosing a least written file is just going to delay file migration
>>>    of an active file which might consume hot tier disk space resulting
>>>    in a ENOSPC, in the worst case.
>>>    In cases where write-freq-threshold are ZERO, the most recently
>>>    *written* file can be chosen for eviction.
>>>    In the case of choosing the largest file within the
>>>    write-freq-threshold, a stat() on the files would be required to
>>>    calculate the number of files that need to be demoted to take the
>>>    watermark below the hi-watermark. Finding the number of most recently
>>>    written files to demote could also help make demotions in parallel
>>>    rather than in the sequential manner currently in place.
>>
>> Update:
>> The idea of choosing the files wrt file size has been dropped.
>> Iteratively, the most recently written file will be chosen for eviction
>> from the hot tier in case of a hi-watermark breach and until the
>> watermark drops below hi-watermark.
>> The idea of parallelizing multiple promotions/demotions has been
>> deferred.
>>
>> -----
>>
>> Sustained writes creating larges files in the hot tier which
>> cumulatively breach the hi-watermark does NOT seem to be a good
>> workload for making use of tiering. The assumption is that, to make the
>> most of of the hot tier, the hi-watermark would be closer to 100.
>> In this case a sustained large file copy might easily breach the
>> hi-watermark and may even consume the entire hot tier space, resulting
>> in a ENOSPC.
>>
>> eg. an example of a sustained write
>>
>> # cp file1 /mnt/glustervol/dir
>>
>> Workloads that would seem to make the most of tiering are:
>> 1. Many smaller files, which are created in small bursts of write
>>    activity and then closed
>> 2. Few large files where updates are in-place and the file size
>>    does not grow beyond the hi-watermark eg. database, with frequent
>>    in-line compaction/de-fragmentation policy enabled
>> 3. Frequent reads of few large files, mostly static in size, which
>>    cumulatively don't breach the hi-watermark. Frequently reading
>>    a large number of smaller, mostly static, files would be good
>>    tiering workload candidates as well.
>>
>>
>>>
>>> Comments are requested.
>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel