[Gluster-devel] Data classification proposal

Xavier Hernandez xhernandez at datalab.es
Thu Jun 26 09:55:11 UTC 2014


On Wednesday 25 June 2014 11:42:10 Jeff Darcy wrote:
> > How space will be allocated to each new sub-brick ? some sort of thin-
> > provisioning or will it be distributed evenly on each split ?
> 
> That's left to the user.  The latest proposal, based on discussion of
> the first, is here:
> 
> https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7l
> DW2wRvA/edit?usp=sharing
> 

Thanks. I didn't know that document.

> That has an example of assigning percentages to the sub-bricks created
> by a rule (i.e. a subvolume in a potentially multi-tiered
> configuration).  Other possibilities include relative weights used to
> determine percentages, or total thin provisioning where sub-bricks
> compete freely for available space.  It's certainly a fruitful area for
> discussion.
> 
> > If using thin-provisioning, it will be hard to determine real
> > available space.  If using a fixed amount, we can get to scenarios
> > where a file cannot be written even if there seems to be enough free
> > space. This can already happen today if using very big files on almost
> > full bricks. I think brick splitting can accentuate this.
> 
> Is this really common outside of test environments, given the sizes of
> modern disks and files?  Even in cases where it might happen, doesn't
> striping address it?

Considering that SSD sizes are still relatively small and that each brick can 
be split many times depending on data classification rules, I don't see it as 
a rare case for some scenarios. Striping can solve the problem at the expense 
of increasing fault probability, requiring more SSD's to compensate.

> 
> We have a whole bunch of problems in this area.  If multiple bricks are
> on the same local file system, their capacity will be double-counted.
> If a second local file system is mounted over part of a brick, the
> additional space won't be counted at all.  We do need a general solution
> to this, but I don't think that solution needs to be part of data
> classification unless there's a specific real-world scenario that DC
> makes worse.
> 

Agreed. This is a problem that should be solved independently of data 
classification.

> > Also, the addition of multiple layered DHT translators, as it's
> > implemented today, could add a lot more of latency, specially on
> > directory listings.
> 
> With http://review.gluster.org/#/c/7702/ this should be less of a
> problem.

This solves one of the problems. Directory listing is still one of the worst 
problems I've found with gluster and I think it's not solved by this patch.

> Also, lookups across multiple tiers are likely to be rare in
> most use cases.  For example, for the name-based filtering (sanlock)
> case, a given file should only *ever* be in one tier so only that tier
> would need to be searched.  For the activity-based tiering case, the
> vast majority of lookups will be for hot files which are (not
> accidentally) in the first tier.

I think this is true as long as the rules are not modified. But if we allow to 
dynamically modify the rules once the volume is already running we will have 
the same problem as with rebalance, since there will be files not residing in 
the right tier for some time, and we need to find them nonetheless. This could 
be alleviated using something similar to the previous patch when the volume 
goes to a steady state again though.

> The only real problem is with *failed*
> lookups, e.g. during create.  We can address that by adding "stubs"
> (similar to linkfiles) in the upper tier, but I'd still want to wait
> until it's proven necessary.  What I would truly resist is any solution
> that involves building tier awareness directly into (one instance of)
> DHT.  Besides requiring a much larger development effort in the present,
> it would throw away the benefit of modularity and hamper other efforts
> in the future.  We need tiering and brick splitting *now*, especially as
> a complement to erasure coding which many won't be able to use
> otherwise.  As far as I can tell, stacking translators is the fastest
> way to get there.
> 

I agree that it's not good to create specific solutions for a problem when 
it's possible to make a more generic solution that could be used to add more 
features. However I'm not so sure that brick splitting is the best solution. 
Basically we need to solve two problems right now: tiering and volume growing 
brick by brick. Brick splitting is a way to implement it, but I don't think 
it's the only one.

> > Another problem I see is that splitting bricks will require a
> > rebalance, which is a costly operation. It doesn't seem right to
> > require a so expensive operation every time you add a new condition on
> > an already created volume.
> 
> Yes, rebalancing is expensive, but that's no different for split bricks
> than whole ones.  Any time you change the definition of what should go
> where, you'll have to move some data into compliance and that's
> expensive.  However, such operations are likely to be very rare.  It's
> highly likely that most uses of this feature will consist of a simple
> two-tier setup defined when the volume is created and never changed
> thereafter, so the only rebalancing would be within a tier - i.e. the
> exact same thing we do today in homogeneous volumes (maybe even slightly
> better).  The only use case I can think of that would involve *frequent*
> tier-config changes is multi-tenancy, but adding a new tenant should
> only affect new data and not require migration of old data.

Is brick-splitting used whenever a new group is created ? or it's only used 
when explicitly noted ? In the document you referenced it seems that there's a 
specific keyword to split bricks, however I don't see how this works with 
replica-2 and replica-3 subgroups on sanlock group. Replica 3 cannot easily 
use the 10 split bricks, it would need to divide them even more.

How and when is brick split used ?

Xavi


More information about the Gluster-devel mailing list