[Gluster-devel] Replicate/AFR Using Broadcast/Multicast?

Gordan Bobic gordan at bobich.net
Wed Oct 13 08:06:56 UTC 2010


Beat Rubischon wrote:
> Hello!
> 
> Quoting <gordan at bobich.net> (13.10.10 09:11):
> 
>>> I would also expect to see network issues as a cluster grows.
>> Performance reducing as the node count increases isn't seen as a bigger
>> issue?
> 
> I have pretty bad experience with Multicast. Running serveral clusters in
> the range 500-1000 nodes in a single broadcast domain over the last year
> showed that broadcast or multicast is able to kill your fabrics easily.

What sort of a cluster are you running with that many nodes? RHCS? 
Heartbeat? Something else entirely? In what arrangement?

> Even the most expensive GigE switch chassis could be killed by 125+ MBytes
> of traffic which is almost nothing :-)

Sounds like a typical example of cost not being a good measure of 
quality and performance. :)

> The inbound traffic must be routed to
> several outbount ports which results in congestion. Throttling and packet
> loss is the result, even for the normal unicast traffic. Multicast or
> broadcast is a nice way for a denial of service in a LAN.

If you have that many GlusterFS nodes, you have DoS-ed the storage 
network anyway purely by the write amplification of the inverse scaling. 
The idea is that you have a (mostly) dedicated VLAN for this.

> In Infiniband multicast typically realized by looping through the
> destinations directly on the source HBA. One of the main target in the
> current development as multicast is a similar network pattern compared to
> the MPI collectives. So there is no win in using multicast over such a
> fabric.

Sure, but historically in the networking space, non-ethernet 
technologies have always been niche, cost ineffective in terms of 
price/performance and only had a temporary performance advantage.

At the moment the only way to achieve linear-ish scaling is to connect 
the nodes directly to each other, but that means that each node in an 
n-way cluster has to have at least n NICs (one to each other node plus 
at least one to the clients). That becomes impractical very quickly.

> Using a parallel storage means you have communications from a large amount
> of nodes to a large amount of servers. Using unicast looks bad in the first
> point of view but I'm confident it's the better solution over all.

The problem is that the write bandwidth is fundamentally limited to the 
speed of your interconnect, and this is shared between all the nodes. So 
if you are on a 1Gb ethernet, your total volume of writes between all 
the replicas cannot exceed 1Gb. If you have 10 nodes, that limits your 
write throughput to 100Mb/s. The only way to scale is either by:

1) Putting ever more NICs into each storage chassis and turn the 
replication network into what is essentially a point-to-point network.
Complex, inefficient and adding to the cost - 1Gb NICs are cheap, but 
10Gb ones are still expensive. There is also a limit on how many NICs 
you can cram into a COTS storage node.

2) Upgrading to an ever faster interconnect (10Gb ethernet, Infiniband, 
etc.). Expensive and still doesn't scale as you add more storage nodes.

Right now more storage nodes means slower storage, and that should 
really be addressed.

Gordan




More information about the Gluster-devel mailing list