[Gluster-devel] Improving real world performance by moving files closer to their target workloads

Mon May 19 20:58:22 UTC 2008

I agree with Gordon that centralizing the metadata store is a bad idea. 
  Why introduce a single point of failure?

Also, I believe the quorum locking idea Gordan and I were discussing is, 
in fact, dependent on all nodes attempting to stay up-to-date with the 
complete file and directory structure, including the latest version # of 
each file and directory and probably which nodes have current copies 
(what I have been calling the "metadata").  If all nodes do not attempt 
to keep up, then the quorum's decision cannot be trusted, as Luke 
pointed out.

If all nodes do attempt to stay up to date with this information, then 
if the node accepting a write goes incommunicado, the quorum can simply 
and effectively roll back the transaction by revoking the node's lock 
and rolling back its idea of the current version number of the affective 
file or directory.

Derek

Gordan Bobic wrote:
> Luke McGregor wrote:
> 
>> Firstly im starting to think that maybe the best way forward is to 
>> somehow
>> centralise the metadata store (be that by distributing it or by having it
>> handled by a dedicated metadata server im not too sure).
> 
> If you _REALLY_ want to do it that way (and personally, I'm not at all 
> convinced), look at Lustre.
> 
>> The reason im
>> thinking this may be the best way forward is that i think that any quorum
>> based approach will not be able to guarentee that any write is 
>> sucessful. if
>> a write occurs and the network is queried for a quorum before the node 
>> with
>> the latest copy has a chance to pass replicate its new data then there is
>> too much room for a quorum being reached without needing the approval 
>> of the
>> node with the latest copy.
> 
> Not true. If you have quorum of the majority of the nodes, (50%+1) for 
> every write, then the node with that latest copy would have had quorum 
> to begin with.
> 
>> This could cause some serious problems especially
>> on a hevially accessed file.
> 
> No, you'd just have to have explicit file locking, like with any other 
> cluster FS. (Note: GlusterFS doesn't do implicit locking at the moment. 
> It will be doing it for O_APPEND access to guarantee that writes are 
> atomic in that mode soon.)
> 
>> The problem would i believe be worsened as the
>> nodes which are hosting any hevially accessed file are the most likely to
>> not respond quickly to any kind of multicast.
> 
> I think it's a non-issue. If you always need majority response, without 
> that majority response, those heavily accessed nodes wouldn't be able to 
> get a lock themselves, either.
> 
>> Furthermore in the essence of performace you would want to act as soon 
>> as a
>> quorum was reached. this would effectivly mean that the nodes which 
>> made the
>> decision on the lock would be the lightest accessed nodes with the hevier
>> accessed nodes responding in slightly more time. I believe this would 
>> mean
>> that in order to be sure of a lock you would need to get a consesus 
>> from all
>> nodes. This would be unpractical.
> 
> As I said, it's an even playing field - heavily accessed nodes need to 
> get a lock by quorum just like any other node.
> 
> Having said that, I think you'd need quorum from not only majority of 
> nodes (that would work if you have a shared fs, but this is a 
> _replicated_ fs) but from majority of nodes that have a copy of the file.
> 
>> Also looking for a file to delete to free space would be a really
>> inefficient proccess as a single request for space would potentially mean
>> querying the network for redundancy information on every other file 
>> stored.
>> this would not be practical.
> 
> It depends on the size of your files. A 1 packet broadcast is a pretty 
> decent trade off for gaining 1GB of space, and pretty inefficient for a 
> 1 byte file. But there isn't really a way around it.
> 
>> I think a centralised metadata system would eliminate these problems 
>> as it
>> would be authoritive. A write shouldnt suceed without the central 
>> metadata
>> being updated and a lock shouldnt be granted without the central metadata
>> allowing it. This would also mean that old files would be invalidated 
>> by the
>> system centrally giving a side effect of allowing an easy rollback 
>> mechanism
>> until those files were deleted (if anybody ever wanted that feature). It
>> would also mean that freeing up space would be a reletivly simple 
>> operation.
> 
> Which has the downside that your metadata store is not as 
> redundant/distributed as your data. IMO, having data and metadata 
> distributed around equally is one of GlusterFS's major advantages over 
> Lustre.
> 
>> Just as an offside note i think your raid6 set will outperform your 
>> raid10
>> set even though raid6 is slower than raid10 and you may be using slower
>> drives, you still have 16 drives in the set which i believe will actually
>> give you faster perfromance than any 4 drive configuration.
> 
> Whether RAID6 is "slow" is entirely down to your RAID controller. If it 
> can calculate the reed-solomon codes faster than the disks can keep up, 
> then of course the total speed of the disks will be the only factor 
> affecting the speed, and in that case, 16 disks will beat 4 disks any day.
> 
> Gordan
> 
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel

-- 
Derek R. Price
Solutions Architect
Ximbiot, LLC <http://ximbiot.com>
Get CVS and Subversion Support from Ximbiot!

v: +1 248.835.1260
f: +1 248.246.1176