[Gluster-devel] Improving real world performance by moving files closer to their target workloads

Jordan Mendler jmendler at ucla.edu
Fri May 16 17:49:35 UTC 2008


I didn't have a chance to read through this entire thread, so hopefully I am
not repeating things that have already been stated.

Your project seems very similar to what we will need to do. We are also
looking to build a large compute/storage cluster for our department (human
genetics, so lots of alignments done with grid-style computing), and I think
HSM/cache would come in handy, as already indicated. A lot of it depends on
your disk usage, but I was thinking of perhaps implementing a system where
(assuming the features get implemented by the time I need them):

1) Compute-nodes use NUFA to write to their local disks which are setup as a
fast/faster software RAID10 (using 4 drives per 1u node), with AFR
replication to another node that is close by.
2) When the compute-nodes start to fill up, HSM migrates data from the
compute-nodes to a separate gluster pool that uses slower disks (RAID6 --
16x1TB in 3U), but with afr and/or striping depending on your needs, as well
as a compression filesystem (or at the gluster level if it gets implemented
soon). Deduplication would also come in handy here whenever that gets
implemented.
3) If needed (depending on your type of data), compute-nodes also have a
fast caching layer for files that are read from the backend storage -- or as
stated previously, some kind of probabilistic HSM sent back to the nodes.

It largely depends on your access patterns, but for us we produce a ton of
data that is write-once, archive and 95% is never accessed and the other 5%
is very rarely accessed where latency is not a big factor -- so we would
only really need 1 and 2.

Jordan

On Fri, May 16, 2008 at 9:14 AM, <gordan at bobich.net> wrote:

> On Fri, 16 May 2008, Derek Price wrote:
>
>  I mostly agree with you.  A few additional points are inlined below.
>>
>> gordan at bobich.net wrote:
>>
>>> On Fri, 16 May 2008, Derek Price wrote:
>>>
>>>  gordan at bobich.net wrote:
>>>>
>>>>> Isn't that effectively the same thing? Unless there is quorum, DLM
>>>>> locks out the entire FS (it also does this when a node dies, until it gets
>>>>> definitive confirmation that it has been successfully fenced). For normal
>>>>> file I/O all nodes in the cluster have to acknowledge a lock before it can
>>>>> be granted.
>>>>>
>>>>
>>>> Why?  It requires a meta-data cache, but as long as every node in the
>>>> quorum stores a given file's most recent revision # when any lock is
>>>> granted, even if it doesn't actually sync the file data, then any quorum
>>>> should be able to agree on what the version number of the most up-to-date
>>>> copy of a file is. All nodes are required to report only if you assume that
>>>> any given file has a small number of "owners" and that the querier doesn't
>>>> know who the owner is.
>>>>
>>>
>>> That's to do with file versioning, not locking, though. What am I
>>> missing?
>>>
>>
>> I'm assuming that versioning and locking can and should be combined.
>> You've admitted the necessity for keeping copies of files synchronized and
>> IO is always going to require some sort of lock to accomplish this.  By
>> having the quorum remain aware of what the most recent version of a given
>> file is, whether that file is locked, and perhaps where copies of the file
>> reside, you could reduce the number of nodes that must be consulted when a
>> lock is needed.
>>
>
> True enough, but some care would need to be exercised to ensure that a this
> doesn't lead to edge cases where a node thinks it still has a lock, but all
> the other nodes have expired it (e.g. temporary network outage).
>
>  I think you will also speed things up if you don't have to consult all
>> nodes for every IO operation.  If all available nodes must be consulted,
>> then you introduce an implicit wait until a specified timeout for every IO
>> request if any single node is down.  With the quorum model, even before
>> fencing takes place, almost half the nodes can go incommunicado and the rest
>> can operate as efficiently as they did with all nodes in service.
>>
>
> Indeed quorum of (n/2)+1 nodes should, in theory, suffice for safely
> granting a lock, but it would probably mean that the locks should be
> refreshed several times more often than the default lock TTL, just to
> account for scope of packet loss. Releases of locks should, of course, be
> explicitly notified to the cluster.
>
>  If some HA and fault-tolerant DHT implementation exists that already
>> handles atomic hash inserts with recognizable failures for keys that already
>> exist, then perhaps that could take the place of DLM's quorum model, but I
>> think any algorithm that requires contacting all nodes will prove to be a
>> bad idea in the end.
>>
>
> Not all nodes - only the nodes that contain a certain file. A single ping
> broadcast to find out who has a copy of the file should prove to be of
> insignifficant bandwidth overheat compared to actual file transfers, unless
> you are dealing with a lot of files that are signifficantly smaller than a
> network packet.
>
>  To remain fault tolerant, this requires that servers make some effort to
>>>> stay up-to-date with the meta-data cache, but maybe this could be dealt with
>>>> efficiently with the DHT someone else brought up?
>>>>
>>>
>>> I'm not sure that so much metadata caching is actually necessary. If a
>>> file open brings the file to the local machine (this cannot be guaranteed
>>> because the local machine may be out of space, and it may be unable to free
>>> space by expunging an old file due to that file not being redundant enough
>>> in the network), then the metadata of that file, being attached to the file,
>>> is implicitly "cached". But this isn't really caching at all - it's
>>> migration.
>>>
>>> The algorithm for opening a file might be as follows:
>>> 1) node broadcasts/multicasts an open request to all peers
>>> 2) peers that have the file available respond with the metadata (size,
>>> version, etc) they have and possibly their current load (to assist with load
>>> balancing by fetching the file from the least loaded peer)
>>> 3.1) if the file is available locally, agree a lock with other nodes, and
>>> use it.
>>> 3.2) if the file is not available locally, but there is enough space,
>>> fetch it and do 3.1)
>>> 3.3) if there isn't enough space locally to fetch the file, see if enough
>>> space can be freed. If this succeeds, do 3.2)
>>> 3.4) if space cannot be freed, use the file remotely from the least
>>> loaded peer.
>>>
>>> Expunging algorithm would be similar - broadcast a file status request
>>> (similar to 1) above). If enough nodes respond with the latest version of
>>> the file (set some threshold depending on how much redundancy is required),
>>> the local file can be be removed and the space freed for a file that is more
>>> useful locally. This shouldn't really happen until the local data store
>>> starts to get full.
>>>
>>
>> I might optimize the expunge algorithm slightly by having nodes with low
>> loads volunteer to copy files that otherwise couldn't be expunged from a
>> node.  Better yet, perhaps, would be a background process that runs on
>> lightly loaded nodes and tries to create additional redundant copies at some
>> configurable tolerance beyond the "minimum # of copies" threshold.
>>
>
> Not just lightly loaded nodes, but more importantly, nodes with most free
> space available. :)
>
>  If copies beyond the minimum are only created on file access, then a
>> heavily loaded node could quickly fill up its own disk with all the
>> "redundant" copies of files and have to start relying on remote access,
>> further bogging down the busy node.
>>
>
> Agreed.
>
>  Locking could be handled somewhat lazily - a lock request gets broadcast
>>> and as long as quorum peers respond, and there are no peers saying "no, I
>>> have that lock!", the lock can be granted. A lock can have TTL (in case a
>>> node dies while holding a lock), and the refresh should be expected if the
>>> node expects to keep the lock. This could be used to speed up locking (each
>>> node would have a list of currently valid locks, without the need to check
>>> explicitly, for example - it would only need to broadcast a lock-request
>>> when it looks like the lock can be granted).
>>>
>>> For file delta writes, an AFR type mechanism could be used to send the
>>> deltas to all the nodes that have the file. This could all get quite tricky,
>>> because it might require a separate multicast group to be set up for up to
>>> every node combination subset, in order to keep the network bandwidth down
>>> (or you'd just end up broadcasting to all nodes, which means things wouldn't
>>> scale as switches should, it'd be more like using hubs).
>>>
>>> This would potentially have the problem that there is only 24 bits of IP
>>> multicast address space, but that should provide enough groups with sensible
>>> redundancy levels to cover all node combinations. This may or may not be way
>>> OTT complicated, though. There is probably a simpler and more sane solution.
>>>
>>
>> I'm not sure what overhead is involved in creating multicast groups, but
>> they would only be required for files currently locked for write, so perhaps
>> creating and discarding the multicast groups could be done in conjunction
>> with creation and release of write locks.
>>
>
> Sure, these could be dynamic, but setup and teardown might cause enough
> overhead that you might as well be broadcasting all the locks and writes,
> and just expect the affected nodes to pick those out of the air and act on
> them.
>
>  It's also possible that you could reduce the complexity of this problem by
>> simply discarding as many copies down to as close to the minimum # as other
>> nodes will allow, on write.  However, I think that might reduce some of the
>> performance benefits this design otherwise gives each node.
>>
>
> Also remember that the broadcasts or multicasts would only actually be
> useful for locks and file discovery. The actual read file transfer would be
> point-to-point and writes would be distributed to only the subset of nodes
> that are currently caching the files.
>
> There would need to be special handling of a case where a node accepting a
> big write is running out of space as a consequence and something has to be
> dropped. Obviously, none of the currently open files can be discarded, so
> there would need to be some kind of an auxiliary process that would make a
> node request a "volunpeer" (pun intended) to take over a file that it needs
> to flush out, if discarting it would bring the redundancy below the required
> threshold.
>
>  Perhaps there are some useful ideas on how to perform this complex
>> synchronization already in the design of P2P file transfer networks? What
>> would that be, something like implicit striping based on the locations of
>> valid redundant copies/deltas?
>>
>
> Freenet and Entropy do something similar, but with fewer constraints. They
> store files in a DHT, route by hash expunge LRU and cache probabalistically.
> However, in that kind of an environment you cannot sensibly enforce a
> minimum redundancy level. Least used files will eventually fall off the
> network as the frequently used files get cached by nodes.
>
> Gordan
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>



More information about the Gluster-devel mailing list