[Gluster-devel] GlusterFS Spare Bricks?

Tue Apr 10 18:50:14 UTC 2012

> On 10/04/2012 09:39, 7220022 wrote:
>> Are there plans to add provisioning of spare bricks in a replicated
>> (or
>> distributed-replicated) configuration? E.g., when a brick in a mirror
>> set dies, the system rebuilds it automatically on a spare, similar to
>> how it'd done by RAID controllers.
>>
>> Nor would it only improve the practical reliability, especially of
>> large clusters, but it'd also make it possible to make
>> better-performing clusters off less expensive components. For example,
>> instead of having slow RAID5 bricks on expensive RAID controllers one
>> uses cheap HBA-s and stripes a few disks per brick in RAID0 - that's
>> faster for writes than RAID 5/6 by an order of magnitude (and, by the
>> way, should improve rebuild times in Gluster many are complaining
>> about.).A failure of one such striped brick is not catastrophic in a
>> mirrored Gluster - but it's better to have spare bricks standing by strewn
> across cluster heads.
>>
>> A more advanced setup at a hardware level involves creating "hybrid
>> disks" whereas HDD vdisks are cached by enterprise-class SSD-s.It
>> works beautifully and makes HDD-s amazingly fast for random
>> transactions.The technology's become widely available for many $500
>> COTS controllers.However, it is not widely known that the results with
>> HDD-s in RAID0 under SSD cache are 10 to 20 (!!) times better than
>> with RAID 5 or 6.
>
> On reads the difference should be negligible unless the array is degraded.
> If it's not, your RAID controller is unfit for purpose.
>
> [AS] I refer to random IOPS in 70K to 200K range  on vdisks in RAID 0 vs. 5
> behind large SSD cache.

But are they read or read-write IOPS? RAID5/6 is going to hammer you on 
random writes because of the RMW overheads, unless your SSD is being 
used for write-behind caching all the writes (which could be deemed 
dangerous).

> Behavior of such "hybrid vdisks" is different from
> pure SSD or HDD-based ones.  Unlike that of the DDR RAM cache, the total R+W
> bandwidth in MB/s of an SSD is limited at the level of its max. read-only
> performance.  Hence the front-end read performance is degraded by the value
> of the (sequential) write load onto the cache upstream from the HDD.  And
> vice versa, the write performance of the hybrid gets degraded by the slow
> write speed of a RAID 5/6 array behind cache - especially at larger queue
> depths.

I'm not sure I grok what you are saying (if you are saying what I think 
you are saying). Surely any sane performance oriented setup would be 
write-behind caching on the SSD (and then flushing it out to the RAID 
array when there is some idle time).

Have you looked at flashcache? It's not as advanced as ZFS' L2ARC, but 
if for whatever reason ZFS isn't an option for you, it's still an 
improvement.

> These limitations, when superposed by most "real-world" test
> patterns leave the array just marginally better for both writes and reads
> than an HDD-based RAID10 one with the same number of drives.  Not quite sure
> why, but it's removing the write speed limit of the HDD-s by changing the
> RAID level from 5 to 0 that clears the bottleneck.

If that is the case, then clearly your SSD isn't being used for write 
caching which removes most of the benefit you are going to get from it. 
See under RAID controller being unfit for purpose. :)

> The relative difference
> gets much higher for both reads and writes than the write performance gap
> between pure HDD RAID 0 and 5 vdisks.

I can only guess that this is an artifact of something anti-clever that 
the RAID controller is doing. I gave on hardware RAID controllers over a 
decade ago for similar reasons.

> Having said that, a lot of RAID controllers are pretty useless.
>
> [AS] the newer LSI 2208-based ones seem okay and recent firmware/drivers
> finally stable.

I'm not convinced. I have some LSI cards in several of my boxes and they 
very consistently drop disks that are running SMART tests in the 
background. I have yet to find a firmware or driver version that 
corrects this. There is a RHBZ ticket open for this somewhere but I 
can't seem to find it at the moment.

> But I agree: we always leave out RAID features apart from
> stripe or mirror and do everything by software.  Advanced features
> (FastPath, CacheCade) though are fantastic if you use SSD-s, either
> standalone or as HDD cache.  In fact we use controllers instead of simple
> HBA-s only to take advantage of these features.

Your experience of the performance that you mention above shows that 
they aren't that great in a lot of cases. I've found that software RAID 
has been faster than hardware raid since before the turn of the century, 
and ZFS cuts off a few more corners.

>> There is no way to use RAID0 in commercial storage, the main reason
>> being the absence of hot-spares.If on the other hand the spares are
>> handled by Gluster in a form of (cached hardware-RAID0) pre-fabricated
>> bricks both very good performance and reasonably sufficient redundancy
>> should be easily achieved.
>
> So why not use ZFS instead? The write performance is significantly better
> than traditional RAID equivalents and you get vastly more flexibility than
> with any hardware RAID solution. And it supports caching data onto SSDs.

> [AS] Good point.  We have no experience though, but we should try.  Do you
> know if it can be made distributed "parallel" such as Gluster and supports
> RDM transport for storage traffic between heads?

In a word - no. I was referring to using ZFS as the backing FS for GLFS.

> The main reason we've been
> looking into Gluster is cheap bandwidth: all our servers and nodes are
> connected via 40Gbit IB fabric, 2 ports per server, 4 on some larger ones,
> non-blocking edge switches, directors at floor level etc - 80 to 90% idle.
> Can you make global spares in ZFS?

No, ZFS is a single-node FS. It can replace your RAID + local FS stack, 
but you would still need to use GLFS on top to get the multi-node 
distributed features.

Gordan