[Gluster-users] Glusterfs Rack-Zone Awareness feature...

Fri Apr 18 07:24:21 UTC 2014

Hi Jeff,

Thanks' for you feedback.
I do not understand why it could be a problem to place the data's replica on a different node group.
If a group of node become unavailable (due to datacenter failure, for example) volume should remain online, using the second group.

Regards

-----Message d'origine-----
De : Jeff Darcy [mailto:jdarcy at redhat.com] 
Envoyé : mardi 15 avril 2014 16:37
À : COCHE Sébastien
Cc : gluster-users at gluster.org
Objet : Re: [Gluster-users] Glusterfs Rack-Zone Awareness feature...

> I have a little question.
> I have read glusterfs documentation looking for a replication 
> management. I want to be able to localize replicas on nodes hosted in 
> 2 Datacenters (dual-building).
> CouchBase provide the feature, I’m looking for GlusterFs : “Rack-Zone 
> Awareness”.
> https://blog.couchbase.com/announcing-couchbase-server-25
> “Rack-Zone Awareness - This feature will allow logical groupings of 
> Couchbase Server nodes (where each group is physically located on a 
> rack or an availability zone). Couchbase Server will automatically 
> allocate replica copies of data on servers that belong to a group 
> different from where the active data lives. This significantly 
> increases reliability in case an entire rack becomes unavailable. This 
> is of particularly importance for customers running deployments in public clouds.”

> Do you know if Glusterfs provide a similar feature ?
> If not, do you plan to develop it, in the near future ?

There are two parts to the answer. Rack-aware placement in general is part of the "data classification" feature planned for the 3.6 release. 

http://www.gluster.org/community/documentation/index.php/Features/data-classification 

With this feature, files can be placed according to various policies using any of several properties associated with objects or physical locations. Rack-aware placement would use the physical location of a brick. Tiering would use the performance properties of a brick and the access time/frequency of an object. Multi-tenancy would use the tenant identity for both bricks and objects. And so on. It's all essentially the same infrastructure. 

For replication decisions in particular, there needs to be another piece. Right now, the way we use N bricks with a replication factor of R is to define N/R replica sets each containing R members. This is sub-optimal in many ways. We can still compare the "value" or "fitness" of two replica sets for storing a particular object, but our options are limited to the replica sets as defined last time bricks were added or removed. The differences between one choice and another effectively get smoothed out, and the load balancing after a failure is less than ideal. To do this right, we need to use more (overlapping) combinations of bricks. Some of us have discussed ways that we can do this without sacrificing the modularity of having distribution and replication as two separate modules, but there's no defined plan or date for that feature becoming available. 

BTW, note that using *too many* combinations can also be a problem. Every time an object is replicated across a certain set of storage locations, it creates a coupling between those locations. Before long, all locations are coupled together, so that *any* failure of R-1 locations anywhere in the system will result in data loss or unavailability. Many systems, possibly including Couchbase Server, have made this mistake and become *less* reliable as a result.  Emin Gün Sirer does a better job describing the problem - and solutions - than I do, here:

http://hackingdistributed.com/2014/02/14/chainsets/