[Gluster-devel] Data classification proposal
Jeff Darcy
jdarcy at redhat.com
Fri May 23 19:30:39 UTC 2014
One of the things holding up our data classification efforts (which include tiering but also other stuff as well) has been the extension of the same conceptual model from the I/O path to the configuration subsystem and ultimately to the user experience. How does an administrator define a tiering policy without tearing their hair out? How does s/he define a mixed replication/erasure-coding setup without wanting to rip *our* hair out? The included Markdown document attempts to remedy this by proposing one out of many possible models and user interfaces. It includes examples for some of the most common use cases, including the "replica 2.5" case we'e been discussing recently. Constructive feedback would be greatly appreciated.
# Data Classification Interface
The data classification feature is extremely flexible, to cover use cases from
SSD/disk tiering to rack-aware placement to security or other policies. With
this flexibility comes complexity. While this complexity does not affect the
I/O path much, it does affect both the volume-configuration subsystem and the
user interface to set placement policies. This document describes one possible
model and user interface.
The model we used is based on two kinds of information: brick descriptions and
aggregation rules. Both are contained in a configuration file (format TBD)
which can be associated with a volume using a volume option.
## Brick Descriptions
A brick is described by a series of simple key/value pairs. Predefined keys
include:
* **media-type**
The underlying media type for the brick. In its simplest form this might
just be *ssd* or *disk*. More sophisticated users might use something like
*15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
backed by a RAID controller.
* **rack** (and/or **row**)
The physical location of the brick. Some policy rules might be set up to
spread data across more than one rack.
User-defined keys are also allowed. For example, some users might use a
*tenant* or *security-level* tag as the basis for their placement policy.
## Aggregation Rules
Aggregation rules are used to define how bricks should be combined into
subvolumes, and those potentially combined into higher-level subvolumes, and so
on until all of the bricks are accounted for. Each aggregation rule consists
of the following parts:
* **id**
The base name of the subvolumes the rule will create. If a rule is applied
multiple times this will yield *id-0*, *id-1*, and so on.
* **selector**
A "filter" for which bricks or lower-level subvolumes the rule will
aggregate. This is an expression similar to a *WHERE* clause in SQL, using
brick/subvolume names and properties in lieu of columns. These values are
then matched against literal values or regular expressions, using the usual
set of boolean operators to arrive at a *yes* or *no* answer to the question
of whether this brick/subvolume is affected by this rule.
* **group-size** (optional)
The number of original bricks/subvolumes to be combined into each produced
subvolume. The special default value zero means to collect all original
bricks or subvolumes into one final subvolume. In this case, *id* is used
directly instead of having a numeric suffix appended.
* **type** (optional)
The type of the generated translator definition(s). Examples might include
"AFR" to do replication, "EC" to do erasure coding, and so on. The more
general data classification task includes the definition of new translators
to do tiering and other kinds of filtering, but those are beyond the scope
of this document. If no type is specified, cluster/dht will be used to do
random placement among its constituents.
* **tag** and **option** (optional, repeatable)
Additional tags and/or options to be applied to each newly created
subvolume. See the "replica 2.5" example to see how this can be used.
Since each type might have unique requirements, such as ensuring that
replication is done across machines or racks whenever possible, it is assumed
that there will be corresponding type-specific scripts or functions to do the
actual aggregation. This might even be made pluggable some day (TBD). Once
all rule-based aggregation has been done, volume options are applied similarly
to how they are now.
Astute readers might have noticed that it's possible for a brick to be
aggregated more than once. This is intentional. If a brick is part of
multiple aggregates, it will be automatically split into multiple bricks
internally but this will be invisible to the user.
## Examples
Let's start with a simple tiering example. Here's what the data-classification
config file might look like.
brick host1:/brick
media-type = ssd
brick host2:/brick
media-type = disk
brick host3:/brick
media-type = disk
rule tier-1
select media-type = ssd
rule tier-2
select media-type = disk
rule all
select tier-1
# use repeated "select" to establish order
select tier-2
type features/tiering
This would create a DHT subvolume name *tier-2* for the bricks on *host2* and
*host3*. Then it would add a features/tiering translator to treat *tier-1* as
its upper tier and *tier-2* as its lower. Here's a more complex example that
adds replication and erasure coding to the mix.
# Assume 20 hosts, four fast and sixteen slow (named appropriately).
rule tier-1
select *fast*
group-size 2
type cluster/afr
rule tier-2
# special pattern matching otherwise-unused bricks
select %{unclaimed}
group-size 8
type cluster/ec parity=2
# i.e. two groups, each six data plus two parity
rule all
select tier-1
select tier-2
type features/tiering
Lastly, here's an example of "replica 2.5" to do three-way replication for some
files but two-way replication for the rest.
rule two-way-parts
select *
group-size 2
type cluster/afr
rule two-way-pool
select two-way-parts*
tag special=no
rule three-way-parts
# use overlapping selections to demonstrate splitting
select *
group-size 3
type cluster/afr
rule three-way-pool
select three-way-parts*
tag special=yes
rule sanlock
select two-way*
select three-way*
type features/filter
# files named *.lock go in the replica-3 pool
option filter-condition-1 name:*.lock
option filter-target-1 three-way-pool
# everything else goes in the replica-2 pool
option default-subvol two-way-pool
More information about the Gluster-devel
mailing list