[Gluster-devel] tiering: emergency demotions

Thu Oct 13 12:45:58 UTC 2016

----- Original Message -----
> From: "Milind Changire" <mchangir at redhat.com>
> To: gluster-devel at gluster.org
> Sent: Thursday, October 13, 2016 7:53:48 AM
> Subject: Re: [Gluster-devel] tiering: emergency demotions
> 
> Dilemma:
> *without* my patch, the demotions in degraded (hi-watermark breached)
> mode happen every 10 seconds by listing *all* files colder than the
> last 10 seconds and sorting them in ascending order w.r.t. the
> (write,read) access time ... so the existing query could take more than
> a minute to list files if there are millions of them
> 
> *with* my patch we currently select a random set of 20 files and demote
> them ... even if they are actively used ... so we either wait for more
> than a minute for the exact listing of cold files in the worst case or
> trade off by demoting hot files without imposing a file selection
> criteria for a quicker turnaround time
> 
> The exponential time window schema to select files discussed over Google
> Hangout has an issue with deciding the start time of the time window,
> although we know the end time being the current time
> 
> So, I think it would be either of the strategies discussed above with a
> trade-off in one way or the other.
> 
> Comments are requested regarding the approach to take for the
> implementation.

Reaching a full hot tier is a catastrophic event; the operator can no longer use the volume. If we find ourselves getting close to this situation we should take every means to get out of it as soon as possible. Performance is a secondary concern in this case.

Right now, a long database query will be O(n). It will therefore always take a long time (a minute or more) when there are large numbers of files (e.g. >10^6). This is only our current scheme and subject to change someday, but for now we must live with O(n).

On the other hand. It may or may not be true that the sample of files we choose to demote will include a file that is being accessed. We could potentially avoid demoting "hot" files by skipping them from the approximate "sample" we take, the criteria for skipping could be an elastic window of time that grows to ensure we eventually demote enough data.

So I think the "approximate" solution is better because the long query time (order of minutes) is something we cannot incur and must avoid, whereas the active file issue is something we can manage.

Avoiding filling up storage units is very much a classic problem. As we know DHT only partially solves it at the moment (write appends can fill up a subvolume). I am querying how ceph tackles this to see if they have any insights.

> 
> Rafi has also suggested to avoid file creation on the hot tier if the
> hot tier has hi-watermark breached to avoid further stress on storage
> capacity and eventual file migration to the cold tier.
> 
> Do we introduce demotion policies like "strict" and "approximate" to
> let user choose the demotion strategy ?
> 1. strict
>     Choosing this strategy could mean we wait for the full and ordered
>     query to complete and only then start demoting the coldest file first
> 
> 2. approximate
>     Choosing this strategy could mean we choose the the first available
>     file from the database query and demote it even if it is hot and
>     actively written to
> 
> 
> Milind
> 
> On 08/12/2016 08:25 PM, Milind Changire wrote:
> > Patch for review: http://review.gluster.org/15158
> >
> > Milind
> >
> > On 08/12/2016 07:27 PM, Milind Changire wrote:
> >> On 08/10/2016 12:06 PM, Milind Changire wrote:
> >>> Emergency demotions will be required whenever writes breach the
> >>> hi-watermark. Emergency demotions are required to avoid ENOSPC in case
> >>> of continuous writes that originate on the hot tier.
> >>>
> >>> There are two concerns in this area:
> >>>
> >>> 1. enforcing max-cycle-time during emergency demotions
> >>>    max-cycle-time is the time the tiering daemon spends in promotions or
> >>>    demotions
> >>>    I tend to think that the tiering daemon skip this check for the
> >>>    emergency situation and continue demotions until the watermark drops
> >>>    below the hi-watermark
> >>
> >> Update:
> >> To keep matters simple and manageable, it has been decided to *enforce*
> >> max-cycle-time to yield the worker threads to attend to impending tier
> >> management tasks if the need arises.
> >>
> >>>
> >>> 2. file demotion policy
> >>>    I tend to think that evicting the largest file with the most recent
> >>>    *write* should be chosen for eviction when write-freq-threshold is
> >>>    NON-ZERO.
> >>>    Choosing a least written file is just going to delay file migration
> >>>    of an active file which might consume hot tier disk space resulting
> >>>    in a ENOSPC, in the worst case.
> >>>    In cases where write-freq-threshold are ZERO, the most recently
> >>>    *written* file can be chosen for eviction.
> >>>    In the case of choosing the largest file within the
> >>>    write-freq-threshold, a stat() on the files would be required to
> >>>    calculate the number of files that need to be demoted to take the
> >>>    watermark below the hi-watermark. Finding the number of most recently
> >>>    written files to demote could also help make demotions in parallel
> >>>    rather than in the sequential manner currently in place.
> >>
> >> Update:
> >> The idea of choosing the files wrt file size has been dropped.
> >> Iteratively, the most recently written file will be chosen for eviction
> >> from the hot tier in case of a hi-watermark breach and until the
> >> watermark drops below hi-watermark.
> >> The idea of parallelizing multiple promotions/demotions has been
> >> deferred.
> >>
> >> -----
> >>
> >> Sustained writes creating larges files in the hot tier which
> >> cumulatively breach the hi-watermark does NOT seem to be a good
> >> workload for making use of tiering. The assumption is that, to make the
> >> most of of the hot tier, the hi-watermark would be closer to 100.
> >> In this case a sustained large file copy might easily breach the
> >> hi-watermark and may even consume the entire hot tier space, resulting
> >> in a ENOSPC.
> >>
> >> eg. an example of a sustained write
> >>
> >> # cp file1 /mnt/glustervol/dir
> >>
> >> Workloads that would seem to make the most of tiering are:
> >> 1. Many smaller files, which are created in small bursts of write
> >>    activity and then closed
> >> 2. Few large files where updates are in-place and the file size
> >>    does not grow beyond the hi-watermark eg. database, with frequent
> >>    in-line compaction/de-fragmentation policy enabled
> >> 3. Frequent reads of few large files, mostly static in size, which
> >>    cumulatively don't breach the hi-watermark. Frequently reading
> >>    a large number of smaller, mostly static, files would be good
> >>    tiering workload candidates as well.
> >>
> >>
> >>>
> >>> Comments are requested.
> >>>
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>