[Gluster-devel] Improving real world performance by moving files closer to their target workloads
Gordan Bobic
gordan at bobich.net
Fri May 16 18:28:44 UTC 2008
Derek Price wrote:
> The DLM that GFS uses must already take this into account since it
> appears to work just fine, and the GPL'd code for that DLM was
> officially added to the Linux kernel with release 2.6.19, according to
> Wikipedia. Not sure how portable that would be, but the source is
> available...
I'm not sure how portable that would be. A nice thing about GlusterFS is
that the only requirement if FUSE, which means it'll also work on Solaris.
>>> If some HA and fault-tolerant DHT implementation exists that already
>>> handles atomic hash inserts with recognizable failures for keys that
>>> already exist, then perhaps that could take the place of DLM's quorum
>>> model, but I think any algorithm that requires contacting all nodes
>>> will prove to be a bad idea in the end.
>>
>> Not all nodes - only the nodes that contain a certain file. A single
>> ping broadcast to find out who has a copy of the file should prove to
>> be of insignifficant bandwidth overheat compared to actual file
>> transfers, unless you are dealing with a lot of files that are
>> signifficantly smaller than a network packet.
>
> My point was that, as I understood your algorithm, a client would not
> know which nodes contained a certain file until all nodes had been
> contacted. So, while the actual bandwidth, even to consult thousands of
> nodes, might be small relative to file transfer bandwidth, the client
> can't assume it has a complete answer until it gets all the replies,
> meaning requests to downed nodes have timed out.
I agree that waiting for all nodes could be an issue in case of downed
nodes, and I concur that quorum would be a good work-around.
Broadcasting a single packet (should easily fit into a single 1500 byte
ethernet frame) so all nodes isn't _hugely_ expensive.
Multicast is usually UDP, so there's no TCP timeouts/retries to contend
with. It wouldn't matter if some nodes are down - we can act as soon as
we have answers from (n/2)+1 nodes, assuming in the case of requesting a
file that isn't local, that one of those peers has the file.
> Meaning that if you
> assume that at least one node will always be down, then the minimum time
> to locate a node with the most recent copy of the file (and thus the
> minimum time to begin any read) is always the timeout attached waiting
> for the ping reply.
There are ways around that. Flag a node as being out of the cluster when
quorum decides it is unresponsive, and fence it.
> Having the entire quorum aware of which version of each file is the most
> recent and where to find the file avoids this problem, again, until just
> less than half the nodes become unreachable.
There should, in theory, be only one version of the file in the entire
cluster. If there isn't, then the AFR auto-heal should be invoked to see
to it that there is only one. The important thing is to know which nodes
have a copy of the file.
>>> I might optimize the expunge algorithm slightly by having nodes with
>>> low loads volunteer to copy files that otherwise couldn't be expunged
>>> from a node. Better yet, perhaps, would be a background process that
>>> runs on lightly loaded nodes and tries to create additional redundant
>>> copies at some configurable tolerance beyond the "minimum # of
>>> copies" threshold.
>>
>> Not just lightly loaded nodes, but more importantly, nodes with most
>> free space available. :)
>
> Yes, the algorithm to detect "loading" should probably consider as many
> resource constraints as appears practical.
Load in terms of performance is a non-critical optimization. Space
requirements being met is a mandatory requirement. :)
>>>> For file delta writes, an AFR type mechanism could be used to send
>>>> the deltas to all the nodes that have the file. This could all get
>>>> quite tricky, because it might require a separate multicast group to
>>>> be set up for up to every node combination subset, in order to keep
>>>> the network bandwidth down (or you'd just end up broadcasting to all
>>>> nodes, which means things wouldn't scale as switches should, it'd be
>>>> more like using hubs).
>>>>
>>>> This would potentially have the problem that there is only 24 bits
>>>> of IP multicast address space, but that should provide enough groups
>>>> with sensible redundancy levels to cover all node combinations. This
>>>> may or may not be way OTT complicated, though. There is probably a
>>>> simpler and more sane solution.
>>>
>>> I'm not sure what overhead is involved in creating multicast groups,
>>> but they would only be required for files currently locked for write,
>>> so perhaps creating and discarding the multicast groups could be done
>>> in conjunction with creation and release of write locks.
>>
>> Sure, these could be dynamic, but setup and teardown might cause
>> enough overhead that you might as well be broadcasting all the locks
>> and writes, and just expect the affected nodes to pick those out of
>> the air and act on them.
>>
>>> It's also possible that you could reduce the complexity of this
>>> problem by simply discarding as many copies down to as close to the
>>> minimum # as other nodes will allow, on write. However, I think that
>>> might reduce some of the performance benefits this design otherwise
>>> gives each node.
>>
>> Also remember that the broadcasts or multicasts would only actually be
>> useful for locks and file discovery. The actual read file transfer
>> would be point-to-point and writes would be distributed to only the
>> subset of nodes that are currently caching the files.
>
> Read would be point-to-point (perhaps multi-point to point for implicit
> read striping across all known valid copies?), but it could still be
> useful to use multi-cast for write, especially if the redundant copies
> were behind a different switch than the node accepting the write. So
> multi-cast setup could happen when a server obtained a write lock, and
> teardown would be delayed until synchronization of redundant copies had
> completed.
Possibly, but if the number of possible node connections could be
enumerated WRT given number of nodes and minimum required redundancy,
setting them up statically and using a hash-lookup would probably be
quicker, as it wouldn't require constant setups/teardowns.
We have 2^24 possible multicast "channels" (addresses).
Number of possible ways to pick k nodes out of n (files being
k-redundant) is
n! / k! (n-k)!
Whether these constraints would allow for sufficiently large clusters, I
don't know.
>> There would need to be special handling of a case where a node
>> accepting a big write is running out of space as a consequence and
>> something has to be dropped. Obviously, none of the currently open
>> files can be discarded, so there would need to be some kind of an
>> auxiliary process that would make a node request a "volunpeer" (pun
>> intended) to take over a file that it needs to flush out, if
>> discarting it would bring the redundancy below the required threshold.
>
> I think this could be worked into the normal expunge algorithm with a
> property like: "ANY request to expunge a file that reduces the file
> count below the redundancy threshold will ALWAYS generate a volunpeer IF
> at least one node exists with the disk space available".
Yes. Failing that, we could try the next LRU file.
> It wouldn't require any special casing - the needed space will always
> become available upon expunge if space for the migrating file exists
> anywhere on the network. If all the files are expunged, or they can't
> be even with this property of expunge, and the local disk still fills
> up, then I think it would be reasonable for the FS to return a disk full
> error.
Agreed.
Gordan
More information about the Gluster-devel
mailing list