[Gluster-devel] Data classification proposal

Tue Jun 24 14:18:05 UTC 2014

Its possible to express your example using lists if their entries are allowed to overlap. I see that you wanted a way to express a matrix (overlapping rules) with gluster's tree-like syntax as backdrop. 

A polytree may be a better term than matrix (DAG without cycles), i.e. when there are overlaps a node in the graph gets multiple in-arcs.

Syntax aside, we seem to part on "where" to solve the problem- config file or UX. I prefer the UX have the logic to build the configuration file, given how complex it can be. My preference would be for the config file be mostly "read only" with extremely simple syntax. 

I'll put some more thought into this and believe this discussion has illuminated some good points.

Brick: host1:/SSD1  SSD1
Brick: host1:/SSD2  SSD2
Brick: host2:/SSD3  SSD3
Brick: host2:/SSD4  SSD4
Brick: host1:/DISK1 DISK1

rule rack4: 
  select SSD1, SSD2, DISK1

# some files should go on ssds in rack 4
rule A: 
  option filter-condition *.lock
  select SSD1, SSD2

# some files should go on ssds anywhere
rule B: 
  option filter-condition *.out
  select SSD1, SSD2, SSD3, SSD4

# some files should go anywhere in rack 4
rule C 
  option filter-condition *.c
  select rack4

# some files we just don't care
rule D
  option filter-condition *.h
  select SSD1, SSD2, SSD3, SSD4, DISK1

volume:
  option filter-condition A,B,C,D

----- Original Message -----
From: "Jeff Darcy" <jdarcy at redhat.com>
To: "Dan Lambright" <dlambrig at redhat.com>
Cc: "Gluster Devel" <gluster-devel at gluster.org>
Sent: Monday, June 23, 2014 7:11:44 PM
Subject: Re: [Gluster-devel] Data classification proposal

> Rather than using the keyword "unclaimed", my instinct was to
> explicitly list which bricks have not been "claimed".  Perhaps you
> have something more subtle in mind, it is not apparent to me from your
> response. Can you provide an example of why it is necessary and a list
> could not be provided in its place? If the list is somehow "difficult
> to figure out", due to a particularly complex setup or some such, I'd
> prefer a CLI/GUI build that list rather than having sysadmins
> hand-edit this file.

It's not *difficult* to make sure every brick has been enumerated by
some rule, and that there are no overlaps, but it's certainly tedious
and error prone.  Imagine that a user has four has bricks in four
machines, using names like serv1-b1, serv1-b2, ..., serv4-b6.
Accordingly, they've set up rules to put serv1* into one set and
serv[234]* into another set (which is already more flexibility than I
think your proposal gave them).  Now when they add serv5 they need an
extra step to add it to the tiering config, which wouldn't have been
necessary if we supported defaults.  What percentage of users would
forget that step at least once?  I don't know for sure, but I'd guess
it's pretty high.

Having a CLI or GUI create configs just means that we have to add
support for defaults there instead.  We'd still have to implement the
same logic, they'd still have to specify the same thing.  That just
seems like moving the problem around instead of solving it.

> The key-value piece seems like syntactic sugar - an "alias". If so,
> let the name itself be the alias. No notions of SSD or physical
> location need be inserted. Unless I am missing that it *is* necessary,
> I stand by that value judgement as a philosophy of not putting
> anything into the configuration file that you don't require. Can you
> provide an example of where it is necessary?

OK...
-----

Brick: SSD1
Brick: SSD2
Brick: SSD3
Brick: SSD4
Brick: DISK1

rack4: SSD1, SSD2, DISK1

filter A : SSD1, SSD2

filter B : SSD1,SSD2, SSD3, SSD4

filter C: rack4

filter D: SSD1, SSD2, SSD3, SSD4, DISK1

meta-filter: filter A, filter B, filter C, filter D

  * some files should go on ssds in rack 4

  * some files should go on ssds anywhere

  * some files should go anywhere in rack 4

  * some files we just don't care

Notice how the rules *overlap*.  We can't support that if our syntax
only allows the user to express a list (or list of lists).  If the list
is ordered by type, we can't also support location-based rules.  If the
list is ordered by location, we lose type-based rules instead.   Brick
properties create a matrix, with an unknown number of dimensions (e.g.
security level, tenant ID, and so on as well as type and location).  The
logical way to represent such a space for rule-matching purposes is to
let users define however many dimensions (keys) as they want and as many
values for each dimension as they want.

Whether the exact string "type" or "unclaimed" appears anywhere isn't
the issue.  What matters is that the *semantics* of assigning properties
to a brick have to be more sophisticated than just assigning each a
position in a list, and we need a syntax that supports those semantics.
Otherwise we'll end up solving the same UX problems again and again each
time we add a feature that involves treating bricks or data differently.
Each time we'll probably do it a little differently and confuse users a
little more, if history is any guide.  That's what I'd rather avoid.