[Gluster-devel] GlusterFS Spare Bricks?

Tue Apr 10 22:09:33 UTC 2012

-----Original Message-----
From: gluster-devel-bounces+7220022=gmail.com at nongnu.org
[mailto:gluster-devel-bounces+7220022=gmail.com at nongnu.org] On Behalf Of
Gordan Bobic
Sent: Tuesday, April 10, 2012 10:50 PM
To: gluster-devel at nongnu.org
Subject: Re: [Gluster-devel] GlusterFS Spare Bricks?

> On 10/04/2012 09:39, 7220022 wrote:
>> Are there plans to add provisioning of spare bricks in a replicated 
>> (or
>> distributed-replicated) configuration? E.g., when a brick in a mirror 
>> set dies, the system rebuilds it automatically on a spare, similar to 
>> how it'd done by RAID controllers.
>>
>> Nor would it only improve the practical reliability, especially of 
>> large clusters, but it'd also make it possible to make 
>> better-performing clusters off less expensive components. For 
>> example, instead of having slow RAID5 bricks on expensive RAID 
>> controllers one uses cheap HBA-s and stripes a few disks per brick in 
>> RAID0 - that's faster for writes than RAID 5/6 by an order of 
>> magnitude (and, by the way, should improve rebuild times in Gluster 
>> many are complaining about.).A failure of one such striped brick is 
>> not catastrophic in a mirrored Gluster - but it's better to have 
>> spare bricks standing by strewn
> across cluster heads.
>>
>> A more advanced setup at a hardware level involves creating "hybrid 
>> disks" whereas HDD vdisks are cached by enterprise-class SSD-s.It 
>> works beautifully and makes HDD-s amazingly fast for random 
>> transactions.The technology's become widely available for many $500 
>> COTS controllers.However, it is not widely known that the results 
>> with HDD-s in RAID0 under SSD cache are 10 to 20 (!!) times better 
>> than with RAID 5 or 6.
>
> On reads the difference should be negligible unless the array is degraded.
> If it's not, your RAID controller is unfit for purpose.
>
> [AS] I refer to random IOPS in 70K to 200K range  on vdisks in RAID 0 
> vs. 5 behind large SSD cache.

But are they read or read-write IOPS? RAID5/6 is going to hammer you on
random writes because of the RMW overheads, unless your SSD is being used
for write-behind caching all the writes (which could be deemed dangerous).
[AS] 70K is write, 200K read.  We use SLC SSD-s only and none of about 150
units failed or degraded so far (2 years, 3 for some).

> Behavior of such "hybrid vdisks" is different from pure SSD or 
> HDD-based ones.  Unlike that of the DDR RAM cache, the total R+W 
> bandwidth in MB/s of an SSD is limited at the level of its max. 
> read-only performance.  Hence the front-end read performance is 
> degraded by the value of the (sequential) write load onto the cache 
> upstream from the HDD.  And vice versa, the write performance of the 
> hybrid gets degraded by the slow write speed of a RAID 5/6 array 
> behind cache - especially at larger queue depths.

I'm not sure I grok what you are saying (if you are saying what I think you
are saying). Surely any sane performance oriented setup would be
write-behind caching on the SSD (and then flushing it out to the RAID array
when there is some idle time).
[AS] Right, and RAID5/6 aren't fast enough for even sequential writes -
cache gets saturated, which manifests itself as the hybrid vdisk slowing
down drastically.  If the HDD-s are in RAOD0 it's way way much faster (for
random both writes and reads on the cached disk.)

Have you looked at flashcache? It's not as advanced as ZFS' L2ARC, but if
for whatever reason ZFS isn't an option for you, it's still an improvement.
[AS] Looking at it and bcache right now

> These limitations, when superposed by most "real-world" test patterns 
> leave the array just marginally better for both writes and reads than 
> an HDD-based RAID10 one with the same number of drives.  Not quite 
> sure why, but it's removing the write speed limit of the HDD-s by 
> changing the RAID level from 5 to 0 that clears the bottleneck.

If that is the case, then clearly your SSD isn't being used for write
caching which removes most of the benefit you are going to get from it. 
See under RAID controller being unfit for purpose. :)
[AS] I'd disagree.  Random reads and writes on cached vdisks are up to 100
times that of the same but uncached RAID0 set of HDD-s on the same
controller.

> The relative difference
> gets much higher for both reads and writes than the write performance 
> gap between pure HDD RAID 0 and 5 vdisks.

I can only guess that this is an artifact of something anti-clever that the
RAID controller is doing. I gave on hardware RAID controllers over a decade
ago for similar reasons.
[AS] Give LSI 9265-8i a try with CacheCade Pro/FastPath keys and an SLC SSD.
It made my hair stand.  It's got a dual-core RISC processor @800 mHz with
performance that 5 years ago was the domain of million-dollar HP or IBM
SAN-s. 

> Having said that, a lot of RAID controllers are pretty useless.
>
> [AS] the newer LSI 2208-based ones seem okay and recent 
> firmware/drivers finally stable.

I'm not convinced. I have some LSI cards in several of my boxes and they
very consistently drop disks that are running SMART tests in the background.
I have yet to find a firmware or driver version that corrects this. There is
a RHBZ ticket open for this somewhere but I can't seem to find it at the
moment.
[AS] 
[AS] The newer 2108 and 2208-based LSI-s (and clones) overheat under load,
which could be the root cause of your problem.  We have 5 controllers per 2U
box, so have to swap the fans in our SuperMicro-s (SC216 24-bay) to 11K RPM
monsters and disable the tach.  Sounds like a jet engine and collects pounds
of dust (telehouse, dirtier than anything you've seen,) but the boards
stopped failing.  Also, if the disks are older 3Gb/s and the controllers are
SAS2 - try forcing 3G to the expander where you have your disks.
Syncing/re-syncing lanes might cause the disks dropped - no controller likes
different speeds on the same PHY.
Totally agree about firmware.  All boards within one system must be the same
and should have the same and preferably the latest firmware release, or
they'll do nasty tricks..  Different boards (even near-identical clones)
won't always work in one system.

> But I agree: we always leave out RAID features apart from stripe or 
> mirror and do everything by software.  Advanced features (FastPath, 
> CacheCade) though are fantastic if you use SSD-s, either standalone or 
> as HDD cache.  In fact we use controllers instead of simple HBA-s only 
> to take advantage of these features.

Your experience of the performance that you mention above shows that they
aren't that great in a lot of cases. I've found that software RAID has been
faster than hardware raid since before the turn of the century, and ZFS cuts
off a few more corners.

[AS] The performance is dazzling, especially controllers with new 2208-s.
It takes 7 to 8 SLC SSD-s to saturate the controller's CPU, while it takes
one and a half SSD-s to saturate any PCI-E RAID card circa 2008.  You are
fundamentally right though: we use these new controllers for i) sheer speed
and ii) SSD caching capability.  RAID, apart from small RAID1 or 0 - we also
use software, as the controllers are clearly unable to work as fast as they
do with loose disks and RAID-ing them at the same time.

>> There is no way to use RAID0 in commercial storage, the main reason 
>> being the absence of hot-spares.If on the other hand the spares are 
>> handled by Gluster in a form of (cached hardware-RAID0) 
>> pre-fabricated bricks both very good performance and reasonably 
>> sufficient redundancy should be easily achieved.
>
> So why not use ZFS instead? The write performance is significantly 
> better than traditional RAID equivalents and you get vastly more 
> flexibility than with any hardware RAID solution. And it supports caching
data onto SSDs.

> [AS] Good point.  We have no experience though, but we should try.  Do 
> you know if it can be made distributed "parallel" such as Gluster and 
> supports RDM transport for storage traffic between heads?

In a word - no. I was referring to using ZFS as the backing FS for GLFS.
[AS] ah clear now...

> The main reason we've been
> looking into Gluster is cheap bandwidth: all our servers and nodes are 
> connected via 40Gbit IB fabric, 2 ports per server, 4 on some larger 
> ones, non-blocking edge switches, directors at floor level etc - 80 to 90%
idle.
> Can you make global spares in ZFS?

No, ZFS is a single-node FS. It can replace your RAID + local FS stack, but
you would still need to use GLFS on top to get the multi-node distributed
features.
[AS] it almost defies the purpose then..  Mdadm on bare Linux does it for
us, and it's awfully fast.

Gone offtopic..  well, is there a way to set up global spare bricks in GLFS?

Gordan

_______________________________________________
Gluster-devel mailing list
Gluster-devel at nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel