[Gluster-devel] Replicate/AFR Using Broadcast/Multicast?

Wed Oct 13 21:30:21 UTC 2010

On 10/13/2010 01:22 PM, Beat Rubischon wrote:
> Hi Gordan!
>
> Quoting<gordan at bobich.net>  (13.10.10 10:06):
>
>> What sort of a cluster are you running with that many nodes? RHCS?
>> Heartbeat? Something else entirely? In what arrangement?
>
> High performance clusters. The main target Gluster was made for :-)

I'm curious about your use case. I'm guessing it is mostly dependant on 
throughput and not particularly sensitive to I/O latency.

>>> Even the most expensive GigE switch chassis could be killed by 125+ MBytes
>>> of traffic which is almost nothing :-)
>> Sounds like a typical example of cost not being a good measure of
>> quality and performance. :)
>
> It's simply a technical limit. Think about what broadcast is and how it
> passes a switch.

I'm fully aware of that, but if your switching fabric can't handle the 
full rated bandwidth of the switch, that's pretty poor. Then again, I 
expect specmanship* everywhere these days and don't believe any figures 
until I've tested them myself.

>>> In Infiniband...
>> Sure, but historically in the networking space, non-ethernet
>> technologies have always been niche, cost ineffective in terms of
>> price/performance and only had a temporary performance advantage.
>
> Right.  You'll be surprised but the price per port is much lower in the
> Infiniband world compared to the 10GigE world. When using GlusterFS inside a
> datacenter Infiniband could be a good choice.

Maybe this year. Unlikely to be the case next year.

>> Right now more storage nodes means slower storage, and that should
>> really be addressed.
>
> Wrong. Assuming you have a "distribute" concept. 10 clients talks to 5
> servers. Storing a file means the client writes the file to one of the
> servers. Reading the same. So the bandwidth of each server is accumulated.
> With GigE this means you'll have about 600MBytes/s network bandwidth.
> Additional servers will add additional bandwidth - as long as you scale not
> only servers but also clients. One small exception: The lookup of a file
> must be directed to all servers. One of the reasons why GlusterFS is
> "better" for a smaller amount of large files as for a large amount of
> smaller files.

Multiple lookup causes latency, and latency is already a serious issue 
on Gluster. I'm talking about the straight replicate case. The number of 
replicas is inversely proportional to the throughput.

> Right when you use a "replicate" concept. Your client has to write to both
> members of the replica.

I usually run with server-side replication specifically for that reason 
- I can have a dedicated VLAN for storage servers with as much network 
bandwidth I can throw at it. Then I can have the servers sort out the 
replication overheads between them, rather than needing a multiple of 
bandwidth to the clients as well.

> Additional replicas will consume additional
> bandwith. But hey - who needs more then two replicas? BTW: The servers will
> never talk to each other. It's always the client who transfers the data.

Unless you use server-side replicate, which is much more manageable and 
controllable in terms of bandwidth requirements. And trust me, > 2 
replicas is useful. I have seen both disks in a RAID1 stripe fail more 
than once.

> The perfect solution is probably a "distribute" over a "replicate". Mirror
> the files over two bricks. Use your mirrors to bild a large filesystem with
> replicate. Your performance will scale with the amount of bricks but you'll
> keep the stability of a fully redundant setup.

Depends on your use case. Sometimes it is more useful to have all the 
data locally available for read-performance. But in that case write 
performance goes through the floor with that many replicase. 
Broadcasting the writes only once would solve it in one fell swoop.

Gordan

*specmanship, n: The art of misrepresenting capabilities of a device for 
marketting purposes, typically by saying it will do X and Y when it 
cannot in fact to X and Y at the same time.