[Gluster-devel] Improving real world performance by moving files closer to their target workloads

Fri May 16 18:28:44 UTC 2008

Derek Price wrote:

> The DLM that GFS uses must already take this into account since it 
> appears to work just fine, and the GPL'd code for that DLM was 
> officially added to the Linux kernel with release 2.6.19, according to 
> Wikipedia.  Not sure how portable that would be, but the source is 
> available...

I'm not sure how portable that would be. A nice thing about GlusterFS is 
that the only requirement if FUSE, which means it'll also work on Solaris.

>>> If some HA and fault-tolerant DHT implementation exists that already 
>>> handles atomic hash inserts with recognizable failures for keys that 
>>> already exist, then perhaps that could take the place of DLM's quorum 
>>> model, but I think any algorithm that requires contacting all nodes 
>>> will prove to be a bad idea in the end.
>>
>> Not all nodes - only the nodes that contain a certain file. A single 
>> ping broadcast to find out who has a copy of the file should prove to 
>> be of insignifficant bandwidth overheat compared to actual file 
>> transfers, unless you are dealing with a lot of files that are 
>> signifficantly smaller than a network packet.
> 
> My point was that, as I understood your algorithm, a client would not 
> know which nodes contained a certain file until all nodes had been 
> contacted.  So, while the actual bandwidth, even to consult thousands of
> nodes, might be small relative to file transfer bandwidth, the client 
> can't assume it has a complete answer until it gets all the replies, 
> meaning requests to downed nodes have timed out.

I agree that waiting for all nodes could be an issue in case of downed 
nodes, and I concur that quorum would be a good work-around.

Broadcasting a single packet (should easily fit into a single 1500 byte 
ethernet frame) so all nodes isn't _hugely_ expensive.

Multicast is usually UDP, so there's no TCP timeouts/retries to contend 
with. It wouldn't matter if some nodes are down - we can act as soon as 
we have answers from (n/2)+1 nodes, assuming in the case of requesting a 
file that isn't local, that one of those peers has the file.

> Meaning that if you 
> assume that at least one node will always be down, then the minimum time 
> to locate a node with the most recent copy of the file (and thus the 
> minimum time to begin any read) is always the timeout attached waiting 
> for the ping reply.

There are ways around that. Flag a node as being out of the cluster when 
quorum decides it is unresponsive, and fence it.

> Having the entire quorum aware of which version of each file is the most 
> recent and where to find the file avoids this problem, again, until just 
> less than half the nodes become unreachable.

There should, in theory, be only one version of the file in the entire 
cluster. If there isn't, then the AFR auto-heal should be invoked to see 
to it that there is only one. The important thing is to know which nodes 
have a copy of the file.

>>> I might optimize the expunge algorithm slightly by having nodes with 
>>> low loads volunteer to copy files that otherwise couldn't be expunged 
>>> from a node.  Better yet, perhaps, would be a background process that 
>>> runs on lightly loaded nodes and tries to create additional redundant 
>>> copies at some configurable tolerance beyond the "minimum # of 
>>> copies" threshold.
>>
>> Not just lightly loaded nodes, but more importantly, nodes with most 
>> free space available. :)
> 
> Yes, the algorithm to detect "loading" should probably consider as many 
> resource constraints as appears practical.

Load in terms of performance is a non-critical optimization. Space 
requirements being met is a mandatory requirement. :)

>>>> For file delta writes, an AFR type mechanism could be used to send 
>>>> the deltas to all the nodes that have the file. This could all get 
>>>> quite tricky, because it might require a separate multicast group to 
>>>> be set up for up to every node combination subset, in order to keep 
>>>> the network bandwidth down (or you'd just end up broadcasting to all 
>>>> nodes, which means things wouldn't scale as switches should, it'd be 
>>>> more like using hubs).
>>>>
>>>> This would potentially have the problem that there is only 24 bits 
>>>> of IP multicast address space, but that should provide enough groups 
>>>> with sensible redundancy levels to cover all node combinations. This 
>>>> may or may not be way OTT complicated, though. There is probably a 
>>>> simpler and more sane solution.
>>>
>>> I'm not sure what overhead is involved in creating multicast groups, 
>>> but they would only be required for files currently locked for write, 
>>> so perhaps creating and discarding the multicast groups could be done 
>>> in conjunction with creation and release of write locks.
>>
>> Sure, these could be dynamic, but setup and teardown might cause 
>> enough overhead that you might as well be broadcasting all the locks 
>> and writes, and just expect the affected nodes to pick those out of 
>> the air and act on them.
>>
>>> It's also possible that you could reduce the complexity of this 
>>> problem by simply discarding as many copies down to as close to the 
>>> minimum # as other nodes will allow, on write.  However, I think that 
>>> might reduce some of the performance benefits this design otherwise 
>>> gives each node.
>>
>> Also remember that the broadcasts or multicasts would only actually be 
>> useful for locks and file discovery. The actual read file transfer 
>> would be point-to-point and writes would be distributed to only the 
>> subset of nodes that are currently caching the files.
> 
> Read would be point-to-point (perhaps multi-point to point for implicit 
> read striping across all known valid copies?), but it could still be 
> useful to use multi-cast for write, especially if the redundant copies 
> were behind a different switch than the node accepting the write.  So 
> multi-cast setup could happen when a server obtained a write lock, and 
> teardown would be delayed until synchronization of redundant copies had 
> completed.

Possibly, but if the number of possible node connections could be 
enumerated WRT given number of nodes and minimum required redundancy, 
setting them up statically and using a hash-lookup would probably be 
quicker, as it wouldn't require constant setups/teardowns.

We have 2^24 possible multicast "channels" (addresses).
Number of possible ways to pick k nodes out of n (files being 
k-redundant) is
n! / k! (n-k)!

Whether these constraints would allow for sufficiently large clusters, I 
don't know.

>> There would need to be special handling of a case where a node 
>> accepting a big write is running out of space as a consequence and 
>> something has to be dropped. Obviously, none of the currently open 
>> files can be discarded, so there would need to be some kind of an 
>> auxiliary process that would make a node request a "volunpeer" (pun 
>> intended) to take over a file that it needs to flush out, if 
>> discarting it would bring the redundancy below the required threshold.
> 
> I think this could be worked into the normal expunge algorithm with a 
> property like: "ANY request to expunge a file that reduces the file 
> count below the redundancy threshold will ALWAYS generate a volunpeer IF 
> at least one node exists with the disk space available".

Yes. Failing that, we could try the next LRU file.

> It wouldn't require any special casing - the needed space will always 
> become available upon expunge if space for the migrating file exists 
> anywhere on the network.  If all the files are expunged, or they can't 
> be even with this property of expunge, and the local disk still fills 
> up, then I think it would be reasonable for the FS to return a disk full 
> error.

Agreed.

Gordan