[Gluster-devel] Multi-network support proposal

Sun Feb 15 23:13:38 UTC 2015

Hope this message makes as much sense to me on Tuesday as it did at 3 AM in the airport ;-) Inline...

----- Original Message -----
> From: "Jeff Darcy" <jdarcy at redhat.com>
> To: "Ben England" <bengland at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Manoj Pillai" <mpillai at redhat.com>
> Sent: Sunday, February 15, 2015 1:49:17 AM
> Subject: Re: [Gluster-devel] Multi-network support proposal
> 
> > It's really important for glusterfs not to require that the clients mount
> > volumes using same subnet that is used by servers, and clearly your very
> > general-purpose proposal could address that.  For example, in a site where
> > non-glusterfs protocols are used, there are already good reasons for using
> > multiple subnets, and we want glusterfs to be able to coexist with
> > non-glusterfs protocols at a site.
> > 
> > However, is there a simpler way to allow glusterfs clients to connect to
> > servers through more than one subnet.  For example, suppose your Gluster
> > volume subnet is 172.17.50.0/24 and your "public" network used by glusterfs
> > clients is 1.2.3.0/22, but one of the servers also has an interface on
> > subnet 4.5.6.0/24 .  So at the time that the volume is either created or
> > bricks are added/removed:
> > 
> > - determine what servers are actually in the volume
> > - ask each server to return the subnet for each of its active network
> > interfaces
> > - determine set of subnets that are directly accessible to ALL the volume's
> > servers
> > - write a glusterfs volfile for each of these subnets and save it
> > 
> > This process is O(N) where N is number of servers, but it only happens for
> > volume creation or addition/removal of bricks, these events do not happen
> > very often (do they?).  In example, 1.2.3.0/22 and 172.17.50.0/24 would
> > have
> > glusterfs volfiles, but 4.5.6.0/22 would not.
> > 
> > So now when a client connects, the server knows which subnet the request
> > came
> > through (getsockaddr), so it can just return the volfile for that subnet.
> > If there is no volfile for that subnet, the client mount request is
> > rejected..  But what about existing Gluster volumes?  When software is
> > upgraded, we should provide a mechanism for triggering this volfile
> > generation process to open up additional subnets for glusterfs clients.
> > 
> > This proposal requires additional work to be done where volfiles are
> > generated and where glusterfs mount processing is done, but does not
> > require
> > any additional configuration commands or extra user knowledge of Gluster.
> > glusterfs clients can then use *any* subnet that is accessible to all the
> > servers.
> 
> That does have the advantage of not requiring any special configuration,
> and might work well enough for front-end traffic, but it has the
> drawback of not giving any control over back-end traffic.  How do
> *servers* choose which interfaces to use for NSR normal traffic,
> reconciliation/self-heal, DHT rebalance, and so on?  Which network
> should Ganesha/Samba servers use to communicate with bricks?  Even on
> the front end, what happens when we do get around to adding per-subnet
> access control or options?  For those kinds of use cases we need
> networks to be explicit parts of our model, not implicit or inferred.
> So maybe we need to reconcile the two approaches, and hope that the
> combined result isn't too complicated.  I'm open to suggestions.
> 

In defense of your proposal, you are right that it is difficult to manage each node's network configuration independently or by volfile, and it would be useful to a system manager to be able to configure Gluster network behavior across the entire volume.  For example, you can use pdsh to issue commands to any subset of Gluster servers, but what if some of them are down at the time the command is issued?  How do you make these configuration changes persistent?  What happens when you add or remove servers from the volume?  That to me is the real selling point of your proposal - if we have a 60-node or even a 1000-node Gluster volume, we could provide a way to control network behavior in a persistent, highly-available, scalable way with as few sysadmin operations as possible. 

I have two concerns:

1) Do we have to specify each host's address rewriting in your example - why not something like this?

# gluster network add client-net 1.2.3.0/24 

glusterd could then use a discovery process as I described earlier to determine for each server what its IP address is on that subnet and rewrite volfiles accordingly.

The advantage of this subnet-based specification IMHO is that it scales - as you add and remove nodes, you do not have to change "client-net" entity, you just make sure that Gluster servers provide the appropriate network interface with appropriate IP address and subnet mask.

2) Could we keep the number of roles and the sysadmin interface in general from getting too complicated?  Here's an oversimplified model of Gluster networking - there are at most 2 kinds of subnets on each server in use by Gluster or apps:

- replication subnet - this is the subnet used at volume creation time.  It is used by any background activity involving communication between servers, including NSR for replication traffic, self-heal, DHT rebalance.
- "access" subnet(s?) - used by all access protocols, including glusterfs, NFS, SMB, Swift, client-side libgfapi, geo-replication, etc.

In some sites, we can use a single subnet that does both "roles", but separation of these two types of traffic can reduce network contention and latency and may be desirable for security reasons, etc.

If you want to get more throughput for either of these subnets, you can use Linux bonding to do so.  Ceph benchmarkers have used bonding on the replication subnet to improve write performance, for example.   Recommendations for bonding usage for Gluster is described here:

http://www.gluster.org/community/documentation/index.php/Network_Configuration_Techniques

But with advent of 40-Gbps networks and RDMA, we have such a huge fire-hose that we may be worrying too much about network optimization and traffic segregation and not worrying enough about Gluster's ability to keep up with it.

Security is another concern, but if you want to address that I still think that a subnet-based approach could be useful for Gluster, or at least we don't want to force people to specify such rules host by host.  Other kinds of security rules can be implemented by a firewall, etc.

-ben