[Gluster-devel] Data classification proposal
Krishnan Parthasarathi
kparthas at redhat.com
Tue Jun 24 07:46:46 UTC 2014
Jeff,
I have a few questions regarding the rules syntax and how they apply.
I think this is different in spirit from the discussion Dan has started
and keeping it separate. See questions inline.
----- Original Message -----
> One of the things holding up our data classification efforts (which include
> tiering but also other stuff as well) has been the extension of the same
> conceptual model from the I/O path to the configuration subsystem and
> ultimately to the user experience. How does an administrator define a
> tiering policy without tearing their hair out? How does s/he define a mixed
> replication/erasure-coding setup without wanting to rip *our* hair out? The
> included Markdown document attempts to remedy this by proposing one out of
> many possible models and user interfaces. It includes examples for some of
> the most common use cases, including the "replica 2.5" case we'e been
> discussing recently. Constructive feedback would be greatly appreciated.
>
>
>
> # Data Classification Interface
>
> The data classification feature is extremely flexible, to cover use cases
> from
> SSD/disk tiering to rack-aware placement to security or other policies. With
> this flexibility comes complexity. While this complexity does not affect the
> I/O path much, it does affect both the volume-configuration subsystem and the
> user interface to set placement policies. This document describes one
> possible
> model and user interface.
>
> The model we used is based on two kinds of information: brick descriptions
> and
> aggregation rules. Both are contained in a configuration file (format TBD)
> which can be associated with a volume using a volume option.
>
> ## Brick Descriptions
>
> A brick is described by a series of simple key/value pairs. Predefined keys
> include:
>
> * **media-type**
> The underlying media type for the brick. In its simplest form this might
> just be *ssd* or *disk*. More sophisticated users might use something
> like
> *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
> backed by a RAID controller.
Am I right if I understood that the value for media-type is not interpreted beyond the
scope of matching rules? That is to say, we don't need/have any notion of media-types
that type check internally for forming (sub)volumes using the rules specified.
>
> * **rack** (and/or **row**)
> The physical location of the brick. Some policy rules might be set up to
> spread data across more than one rack.
>
> User-defined keys are also allowed. For example, some users might use a
> *tenant* or *security-level* tag as the basis for their placement policy.
>
> ## Aggregation Rules
>
> Aggregation rules are used to define how bricks should be combined into
> subvolumes, and those potentially combined into higher-level subvolumes, and
> so
> on until all of the bricks are accounted for. Each aggregation rule consists
> of the following parts:
>
> * **id**
> The base name of the subvolumes the rule will create. If a rule is
> applied
> multiple times this will yield *id-0*, *id-1*, and so on.
>
> * **selector**
> A "filter" for which bricks or lower-level subvolumes the rule will
> aggregate. This is an expression similar to a *WHERE* clause in SQL,
> using
> brick/subvolume names and properties in lieu of columns. These values are
> then matched against literal values or regular expressions, using the
> usual
> set of boolean operators to arrive at a *yes* or *no* answer to the
> question
> of whether this brick/subvolume is affected by this rule.
>
> * **group-size** (optional)
> The number of original bricks/subvolumes to be combined into each produced
> subvolume. The special default value zero means to collect all original
> bricks or subvolumes into one final subvolume. In this case, *id* is used
> directly instead of having a numeric suffix appended.
Should the no. of bricks or lower-level subvolumes that match the rule be an exact
multiple of group-size?
>
> * **type** (optional)
> The type of the generated translator definition(s). Examples might
> include
> "AFR" to do replication, "EC" to do erasure coding, and so on. The more
> general data classification task includes the definition of new
> translators
> to do tiering and other kinds of filtering, but those are beyond the scope
> of this document. If no type is specified, cluster/dht will be used to do
> random placement among its constituents.
>
> * **tag** and **option** (optional, repeatable)
> Additional tags and/or options to be applied to each newly created
> subvolume. See the "replica 2.5" example to see how this can be used.
>
> Since each type might have unique requirements, such as ensuring that
> replication is done across machines or racks whenever possible, it is assumed
> that there will be corresponding type-specific scripts or functions to do the
> actual aggregation. This might even be made pluggable some day (TBD). Once
> all rule-based aggregation has been done, volume options are applied
> similarly
> to how they are now.
>
> Astute readers might have noticed that it's possible for a brick to be
> aggregated more than once. This is intentional. If a brick is part of
> multiple aggregates, it will be automatically split into multiple bricks
> internally but this will be invisible to the user.
>
> ## Examples
>
> Let's start with a simple tiering example. Here's what the
> data-classification
> config file might look like.
>
> brick host1:/brick
> media-type = ssd
>
> brick host2:/brick
> media-type = disk
>
> brick host3:/brick
> media-type = disk
>
> rule tier-1
> select media-type = ssd
>
> rule tier-2
> select media-type = disk
>
> rule all
> select tier-1
> # use repeated "select" to establish order
> select tier-2
> type features/tiering
>
> This would create a DHT subvolume name *tier-2* for the bricks on *host2* and
> *host3*. Then it would add a features/tiering translator to treat *tier-1*
> as
> its upper tier and *tier-2* as its lower. Here's a more complex example that
> adds replication and erasure coding to the mix.
>
> # Assume 20 hosts, four fast and sixteen slow (named appropriately).
>
> rule tier-1
> select *fast*
> group-size 2
> type cluster/afr
>
> rule tier-2
> # special pattern matching otherwise-unused bricks
> select %{unclaimed}
> group-size 8
> type cluster/ec parity=2
> # i.e. two groups, each six data plus two parity
>
> rule all
> select tier-1
> select tier-2
> type features/tiering
>
In the above example we would have 2 subvolumes each containing 2 bricks that would be
aggregated by rule tier-1. Lets call those subvolumes as tier-1-fast-0 and tier-fast-1.
Both of these subvolumes are afr based two-way replicated subvolumes. Are these instances
of tier-1-* composed using cluster/dht by the default semantics?
> Lastly, here's an example of "replica 2.5" to do three-way replication for
> some
> files but two-way replication for the rest.
>
> rule two-way-parts
> select *
> group-size 2
> type cluster/afr
>
> rule two-way-pool
> select two-way-parts*
> tag special=no
>
> rule three-way-parts
> # use overlapping selections to demonstrate splitting
> select *
> group-size 3
> type cluster/afr
>
> rule three-way-pool
> select three-way-parts*
> tag special=yes
>
> rule sanlock
> select two-way*
> select three-way*
> type features/filter
> # files named *.lock go in the replica-3 pool
> option filter-condition-1 name:*.lock
> option filter-target-1 three-way-pool
> # everything else goes in the replica-2 pool
> option default-subvol two-way-pool
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>
More information about the Gluster-devel
mailing list