[Gluster-devel] Data classification proposal

Tue Jun 24 07:46:46 UTC 2014

Jeff,

I have a few questions regarding the rules syntax and how they apply.
I think this is different in spirit from the discussion Dan has started
and keeping it separate. See questions inline.

----- Original Message -----
> One of the things holding up our data classification efforts (which include
> tiering but also other stuff as well) has been the extension of the same
> conceptual model from the I/O path to the configuration subsystem and
> ultimately to the user experience.  How does an administrator define a
> tiering policy without tearing their hair out?  How does s/he define a mixed
> replication/erasure-coding setup without wanting to rip *our* hair out?  The
> included Markdown document attempts to remedy this by proposing one out of
> many possible models and user interfaces.  It includes examples for some of
> the most common use cases, including the "replica 2.5" case we'e been
> discussing recently.  Constructive feedback would be greatly appreciated.
> 
> 
> 
> # Data Classification Interface
> 
> The data classification feature is extremely flexible, to cover use cases
> from
> SSD/disk tiering to rack-aware placement to security or other policies.  With
> this flexibility comes complexity.  While this complexity does not affect the
> I/O path much, it does affect both the volume-configuration subsystem and the
> user interface to set placement policies.  This document describes one
> possible
> model and user interface.
> 
> The model we used is based on two kinds of information: brick descriptions
> and
> aggregation rules.  Both are contained in a configuration file (format TBD)
> which can be associated with a volume using a volume option.
> 
> ## Brick Descriptions
> 
> A brick is described by a series of simple key/value pairs.  Predefined keys
> include:
> 
>  * **media-type**
>    The underlying media type for the brick.  In its simplest form this might
>    just be *ssd* or *disk*.  More sophisticated users might use something
>    like
>    *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
>    backed by a RAID controller.

Am I right if I understood that the value for media-type is not interpreted beyond the
scope of matching rules? That is to say, we don't need/have any notion of media-types
that type check internally for forming (sub)volumes using the rules specified.

> 
>  * **rack** (and/or **row**)
>    The physical location of the brick.  Some policy rules might be set up to
>    spread data across more than one rack.
> 
> User-defined keys are also allowed.  For example, some users might use a
> *tenant* or *security-level* tag as the basis for their placement policy.
> 
> ## Aggregation Rules
> 
> Aggregation rules are used to define how bricks should be combined into
> subvolumes, and those potentially combined into higher-level subvolumes, and
> so
> on until all of the bricks are accounted for.  Each aggregation rule consists
> of the following parts:
> 
>  * **id**
>    The base name of the subvolumes the rule will create.  If a rule is
>    applied
>    multiple times this will yield *id-0*, *id-1*, and so on.
> 
>  * **selector**
>    A "filter" for which bricks or lower-level subvolumes the rule will
>    aggregate.  This is an expression similar to a *WHERE* clause in SQL,
>    using
>    brick/subvolume names and properties in lieu of columns.  These values are
>    then matched against literal values or regular expressions, using the
>    usual
>    set of boolean operators to arrive at a *yes* or *no* answer to the
>    question
>    of whether this brick/subvolume is affected by this rule.
> 
>  * **group-size** (optional)
>    The number of original bricks/subvolumes to be combined into each produced
>    subvolume.  The special default value zero means to collect all original
>    bricks or subvolumes into one final subvolume.  In this case, *id* is used
>    directly instead of having a numeric suffix appended.

Should the no. of bricks or lower-level subvolumes that match the rule be an exact
multiple of group-size?

> 
>  * **type** (optional)
>    The type of the generated translator definition(s).  Examples might
>    include
>    "AFR" to do replication, "EC" to do erasure coding, and so on.  The more
>    general data classification task includes the definition of new
>    translators
>    to do tiering and other kinds of filtering, but those are beyond the scope
>    of this document.  If no type is specified, cluster/dht will be used to do
>    random placement among its constituents.
> 
>  * **tag** and **option** (optional, repeatable)
>    Additional tags and/or options to be applied to each newly created
>    subvolume.  See the "replica 2.5" example to see how this can be used.
> 
> Since each type might have unique requirements, such as ensuring that
> replication is done across machines or racks whenever possible, it is assumed
> that there will be corresponding type-specific scripts or functions to do the
> actual aggregation.  This might even be made pluggable some day (TBD).  Once
> all rule-based aggregation has been done, volume options are applied
> similarly
> to how they are now.
> 
> Astute readers might have noticed that it's possible for a brick to be
> aggregated more than once.  This is intentional.  If a brick is part of
> multiple aggregates, it will be automatically split into multiple bricks
> internally but this will be invisible to the user.
> 
> ## Examples
> 
> Let's start with a simple tiering example.  Here's what the
> data-classification
> config file might look like.
> 
> 	brick host1:/brick
> 		media-type = ssd
> 
> 	brick host2:/brick
> 		media-type = disk
> 
> 	brick host3:/brick
> 		media-type = disk
> 
> 	rule tier-1
> 		select media-type = ssd
> 
> 	rule tier-2
> 		select media-type = disk
> 
> 	rule all
> 		select tier-1
> 		# use repeated "select" to establish order
> 		select tier-2
> 		type features/tiering
> 
> This would create a DHT subvolume name *tier-2* for the bricks on *host2* and
> *host3*.  Then it would add a features/tiering translator to treat *tier-1*
> as
> its upper tier and *tier-2* as its lower.  Here's a more complex example that
> adds replication and erasure coding to the mix.
> 
> 	# Assume 20 hosts, four fast and sixteen slow (named appropriately).
> 
> 	rule tier-1
> 		select *fast*
> 		group-size 2
> 		type cluster/afr
> 
> 	rule tier-2
> 		# special pattern matching otherwise-unused bricks
> 		select %{unclaimed}
> 		group-size 8
> 		type cluster/ec parity=2
> 		# i.e. two groups, each six data plus two parity
> 
> 	rule all
> 		select tier-1
> 		select tier-2
> 		type features/tiering
> 

In the above example we would have 2 subvolumes each containing 2 bricks that would be
aggregated by rule tier-1. Lets call those subvolumes as tier-1-fast-0 and tier-fast-1.
Both of these subvolumes are afr based two-way replicated subvolumes. Are these instances
of tier-1-* composed using cluster/dht by the default semantics?

> Lastly, here's an example of "replica 2.5" to do three-way replication for
> some
> files but two-way replication for the rest.
> 
> 	rule two-way-parts
> 		select *
> 		group-size 2
> 		type cluster/afr
> 
> 	rule two-way-pool
> 		select two-way-parts*
> 		tag special=no
> 
> 	rule three-way-parts
> 		# use overlapping selections to demonstrate splitting
> 		select *
> 		group-size 3
> 		type cluster/afr
> 
> 	rule three-way-pool
> 		select three-way-parts*
> 		tag special=yes
> 
> 	rule sanlock
> 		select two-way*
> 		select three-way*
> 		type features/filter
> 		# files named *.lock go in the replica-3 pool
> 		option filter-condition-1 name:*.lock
> 		option filter-target-1 three-way-pool
> 		# everything else goes in the replica-2 pool
> 		option default-subvol two-way-pool
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>