[Gluster-devel] Improving real world performance by moving files closer to their target workloads

Fri May 16 17:26:51 UTC 2008

gordan at bobich.net wrote:
>> I'm assuming that versioning and locking can and should be combined. 
>> You've admitted the necessity for keeping copies of files synchronized 
>> and IO is always going to require some sort of lock to accomplish 
>> this.  By having the quorum remain aware of what the most recent 
>> version of a given file is, whether that file is locked, and perhaps 
>> where copies of the file reside, you could reduce the number of nodes 
>> that must be consulted when a lock is needed.
> 
> True enough, but some care would need to be exercised to ensure that a 
> this doesn't lead to edge cases where a node thinks it still has a lock, 
> but all the other nodes have expired it (e.g. temporary network outage).

The DLM that GFS uses must already take this into account since it 
appears to work just fine, and the GPL'd code for that DLM was 
officially added to the Linux kernel with release 2.6.19, according to 
Wikipedia.  Not sure how portable that would be, but the source is 
available...

>> I think you will also speed things up if you don't have to consult all 
>> nodes for every IO operation.  If all available nodes must be 
>> consulted, then you introduce an implicit wait until a specified 
>> timeout for every IO request if any single node is down.  With the 
>> quorum model, even before fencing takes place, almost half the nodes 
>> can go incommunicado and the rest can operate as efficiently as they 
>> did with all nodes in service.
> 
> Indeed quorum of (n/2)+1 nodes should, in theory, suffice for safely 
> granting a lock, but it would probably mean that the locks should be 
> refreshed several times more often than the default lock TTL, just to 
> account for scope of packet loss. Releases of locks should, of course, 
> be explicitly notified to the cluster.

Yes.  Again, perhaps the DLM from RedHat's cluster project already 
solves this?

Just brainstorming quorum theory, but if scalability of the quorum model 
becomes an issue, perhaps the quorum could only be consulted to elect, 
discover, and verify a mini-quorum of nodes (perhaps elected partially 
based on the fact that they reside behind different switches).  Then 
once a node was aware of the identity of the mini-quorum, it would only 
have to cummunicate with a single node for locking and file discovery 
purposes.  This assumes that hosts in the mini-quorum would refuse to 
cooperate with a client if they could not confirm with the quorum that 
they were still part of the mini-quorum or couldn't contact the rest of 
the mini-quorum, of course, and would consult the quorum to elect a new 
mini-quorum when either of these was not the case.

>> If some HA and fault-tolerant DHT implementation exists that already 
>> handles atomic hash inserts with recognizable failures for keys that 
>> already exist, then perhaps that could take the place of DLM's quorum 
>> model, but I think any algorithm that requires contacting all nodes 
>> will prove to be a bad idea in the end.
> 
> Not all nodes - only the nodes that contain a certain file. A single 
> ping broadcast to find out who has a copy of the file should prove to be 
> of insignifficant bandwidth overheat compared to actual file transfers, 
> unless you are dealing with a lot of files that are signifficantly 
> smaller than a network packet.

My point was that, as I understood your algorithm, a client would not 
know which nodes contained a certain file until all nodes had been 
contacted.  So, while the actual bandwidth, even to consult thousands of 
nodes, might be small relative to file transfer bandwidth, the client 
can't assume it has a complete answer until it gets all the replies, 
meaning requests to downed nodes have timed out.  Meaning that if you 
assume that at least one node will always be down, then the minimum time 
to locate a node with the most recent copy of the file (and thus the 
minimum time to begin any read) is always the timeout attached waiting 
for the ping reply.

Having the entire quorum aware of which version of each file is the most 
recent and where to find the file avoids this problem, again, until just 
less than half the nodes become unreachable.

>> I might optimize the expunge algorithm slightly by having nodes with 
>> low loads volunteer to copy files that otherwise couldn't be expunged 
>> from a node.  Better yet, perhaps, would be a background process that 
>> runs on lightly loaded nodes and tries to create additional redundant 
>> copies at some configurable tolerance beyond the "minimum # of copies" 
>> threshold.
> 
> Not just lightly loaded nodes, but more importantly, nodes with most 
> free space available. :)

Yes, the algorithm to detect "loading" should probably consider as many 
resource constraints as appears practical.

>>> For file delta writes, an AFR type mechanism could be used to send 
>>> the deltas to all the nodes that have the file. This could all get 
>>> quite tricky, because it might require a separate multicast group to 
>>> be set up for up to every node combination subset, in order to keep 
>>> the network bandwidth down (or you'd just end up broadcasting to all 
>>> nodes, which means things wouldn't scale as switches should, it'd be 
>>> more like using hubs).
>>>
>>> This would potentially have the problem that there is only 24 bits of 
>>> IP multicast address space, but that should provide enough groups 
>>> with sensible redundancy levels to cover all node combinations. This 
>>> may or may not be way OTT complicated, though. There is probably a 
>>> simpler and more sane solution.
>>
>> I'm not sure what overhead is involved in creating multicast groups, 
>> but they would only be required for files currently locked for write, 
>> so perhaps creating and discarding the multicast groups could be done 
>> in conjunction with creation and release of write locks.
> 
> Sure, these could be dynamic, but setup and teardown might cause enough 
> overhead that you might as well be broadcasting all the locks and 
> writes, and just expect the affected nodes to pick those out of the air 
> and act on them.
> 
>> It's also possible that you could reduce the complexity of this 
>> problem by simply discarding as many copies down to as close to the 
>> minimum # as other nodes will allow, on write.  However, I think that 
>> might reduce some of the performance benefits this design otherwise 
>> gives each node.
> 
> Also remember that the broadcasts or multicasts would only actually be 
> useful for locks and file discovery. The actual read file transfer would 
> be point-to-point and writes would be distributed to only the subset of 
> nodes that are currently caching the files.

Read would be point-to-point (perhaps multi-point to point for implicit 
read striping across all known valid copies?), but it could still be 
useful to use multi-cast for write, especially if the redundant copies 
were behind a different switch than the node accepting the write.  So 
multi-cast setup could happen when a server obtained a write lock, and 
teardown would be delayed until synchronization of redundant copies had 
completed.

> There would need to be special handling of a case where a node accepting 
> a big write is running out of space as a consequence and something has 
> to be dropped. Obviously, none of the currently open files can be 
> discarded, so there would need to be some kind of an auxiliary process 
> that would make a node request a "volunpeer" (pun intended) to take over 
> a file that it needs to flush out, if discarting it would bring the 
> redundancy below the required threshold.

I think this could be worked into the normal expunge algorithm with a 
property like: "ANY request to expunge a file that reduces the file 
count below the redundancy threshold will ALWAYS generate a volunpeer IF 
at least one node exists with the disk space available".

It wouldn't require any special casing - the needed space will always 
become available upon expunge if space for the migrating file exists 
anywhere on the network.  If all the files are expunged, or they can't 
be even with this property of expunge, and the local disk still fills 
up, then I think it would be reasonable for the FS to return a disk full 
error.

Regards,

Derek
-- 
Derek R. Price
Solutions Architect
Ximbiot, LLC <http://ximbiot.com>
Get CVS and Subversion Support from Ximbiot!

v: +1 248.835.1260
f: +1 248.246.1176