[Gluster-devel] Replicate/AFR Using Broadcast/Multicast?
Gordan Bobic
gordan at bobich.net
Wed Oct 13 21:30:21 UTC 2010
On 10/13/2010 01:22 PM, Beat Rubischon wrote:
> Hi Gordan!
>
> Quoting<gordan at bobich.net> (13.10.10 10:06):
>
>> What sort of a cluster are you running with that many nodes? RHCS?
>> Heartbeat? Something else entirely? In what arrangement?
>
> High performance clusters. The main target Gluster was made for :-)
I'm curious about your use case. I'm guessing it is mostly dependant on
throughput and not particularly sensitive to I/O latency.
>>> Even the most expensive GigE switch chassis could be killed by 125+ MBytes
>>> of traffic which is almost nothing :-)
>> Sounds like a typical example of cost not being a good measure of
>> quality and performance. :)
>
> It's simply a technical limit. Think about what broadcast is and how it
> passes a switch.
I'm fully aware of that, but if your switching fabric can't handle the
full rated bandwidth of the switch, that's pretty poor. Then again, I
expect specmanship* everywhere these days and don't believe any figures
until I've tested them myself.
>>> In Infiniband...
>> Sure, but historically in the networking space, non-ethernet
>> technologies have always been niche, cost ineffective in terms of
>> price/performance and only had a temporary performance advantage.
>
> Right. You'll be surprised but the price per port is much lower in the
> Infiniband world compared to the 10GigE world. When using GlusterFS inside a
> datacenter Infiniband could be a good choice.
Maybe this year. Unlikely to be the case next year.
>> Right now more storage nodes means slower storage, and that should
>> really be addressed.
>
> Wrong. Assuming you have a "distribute" concept. 10 clients talks to 5
> servers. Storing a file means the client writes the file to one of the
> servers. Reading the same. So the bandwidth of each server is accumulated.
> With GigE this means you'll have about 600MBytes/s network bandwidth.
> Additional servers will add additional bandwidth - as long as you scale not
> only servers but also clients. One small exception: The lookup of a file
> must be directed to all servers. One of the reasons why GlusterFS is
> "better" for a smaller amount of large files as for a large amount of
> smaller files.
Multiple lookup causes latency, and latency is already a serious issue
on Gluster. I'm talking about the straight replicate case. The number of
replicas is inversely proportional to the throughput.
> Right when you use a "replicate" concept. Your client has to write to both
> members of the replica.
I usually run with server-side replication specifically for that reason
- I can have a dedicated VLAN for storage servers with as much network
bandwidth I can throw at it. Then I can have the servers sort out the
replication overheads between them, rather than needing a multiple of
bandwidth to the clients as well.
> Additional replicas will consume additional
> bandwith. But hey - who needs more then two replicas? BTW: The servers will
> never talk to each other. It's always the client who transfers the data.
Unless you use server-side replicate, which is much more manageable and
controllable in terms of bandwidth requirements. And trust me, > 2
replicas is useful. I have seen both disks in a RAID1 stripe fail more
than once.
> The perfect solution is probably a "distribute" over a "replicate". Mirror
> the files over two bricks. Use your mirrors to bild a large filesystem with
> replicate. Your performance will scale with the amount of bricks but you'll
> keep the stability of a fully redundant setup.
Depends on your use case. Sometimes it is more useful to have all the
data locally available for read-performance. But in that case write
performance goes through the floor with that many replicase.
Broadcasting the writes only once would solve it in one fell swoop.
Gordan
*specmanship, n: The art of misrepresenting capabilities of a device for
marketting purposes, typically by saying it will do X and Y when it
cannot in fact to X and Y at the same time.
More information about the Gluster-devel
mailing list