[Gluster-devel] Data classification proposal

Wed Jun 25 15:42:10 UTC 2014

> If I understand correctly the proposed data-classification
> architecture, each server will have a number of bricks that will be
> dynamically modified as needed: as more data-classifying conditions
> are defined, a new layer of translators will be added (a new DHT or
> AFR, or something else) and some or all existing bricks will be split
> to accommodate the new and, maybe, overlapping condition.

Correct.

> How space will be allocated to each new sub-brick ? some sort of thin-
> provisioning or will it be distributed evenly on each split ?

That's left to the user.  The latest proposal, based on discussion of
the first, is here:

https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7lDW2wRvA/edit?usp=sharing

That has an example of assigning percentages to the sub-bricks created
by a rule (i.e. a subvolume in a potentially multi-tiered
configuration).  Other possibilities include relative weights used to
determine percentages, or total thin provisioning where sub-bricks
compete freely for available space.  It's certainly a fruitful area for
discussion.

> If using thin-provisioning, it will be hard to determine real
> available space.  If using a fixed amount, we can get to scenarios
> where a file cannot be written even if there seems to be enough free
> space. This can already happen today if using very big files on almost
> full bricks. I think brick splitting can accentuate this.

Is this really common outside of test environments, given the sizes of
modern disks and files?  Even in cases where it might happen, doesn't
striping address it?

We have a whole bunch of problems in this area.  If multiple bricks are
on the same local file system, their capacity will be double-counted.
If a second local file system is mounted over part of a brick, the
additional space won't be counted at all.  We do need a general solution
to this, but I don't think that solution needs to be part of data
classification unless there's a specific real-world scenario that DC
makes worse.

> Also, the addition of multiple layered DHT translators, as it's
> implemented today, could add a lot more of latency, specially on
> directory listings.

With http://review.gluster.org/#/c/7702/ this should be less of a
problem.  Also, lookups across multiple tiers are likely to be rare in
most use cases.  For example, for the name-based filtering (sanlock)
case, a given file should only *ever* be in one tier so only that tier
would need to be searched.  For the activity-based tiering case, the
vast majority of lookups will be for hot files which are (not
accidentally) in the first tier.  The only real problem is with *failed*
lookups, e.g. during create.  We can address that by adding "stubs"
(similar to linkfiles) in the upper tier, but I'd still want to wait
until it's proven necessary.  What I would truly resist is any solution
that involves building tier awareness directly into (one instance of)
DHT.  Besides requiring a much larger development effort in the present,
it would throw away the benefit of modularity and hamper other efforts
in the future.  We need tiering and brick splitting *now*, especially as
a complement to erasure coding which many won't be able to use
otherwise.  As far as I can tell, stacking translators is the fastest
way to get there.

> Another problem I see is that splitting bricks will require a
> rebalance, which is a costly operation. It doesn't seem right to
> require a so expensive operation every time you add a new condition on
> an already created volume.

Yes, rebalancing is expensive, but that's no different for split bricks
than whole ones.  Any time you change the definition of what should go
where, you'll have to move some data into compliance and that's
expensive.  However, such operations are likely to be very rare.  It's
highly likely that most uses of this feature will consist of a simple
two-tier setup defined when the volume is created and never changed
thereafter, so the only rebalancing would be within a tier - i.e. the
exact same thing we do today in homogeneous volumes (maybe even slightly
better).  The only use case I can think of that would involve *frequent*
tier-config changes is multi-tenancy, but adding a new tenant should
only affect new data and not require migration of old data.