[Gluster-devel] GlusterFS Spare Bricks?

Thu Apr 12 23:11:08 UTC 2012

http://www.google.ru/url?sa=t&rct=j&q=gluster%20virtual%20storage%20applianc
e%20infiniband%20ssd%20performance&source=web&cd=26&ved=0CFsQFjAFOBQ&url=htt
p%3A%2F%2Fwww.lighthouse-partners.com%2Flinux%2Fpresentations09%2FHPL09-Sess
ion6.pdf&ei=c12HT8WtForgtQa79Jy-BA&usg=AFQjCNFwOz2DTSWvSLQETiXtR2Qy-szOPA&ca
d=rjt

from page 27, discussion of storage redundancy issues - could be useful too.

-----Original Message-----
From: abperiasamy at gmail.com [mailto:abperiasamy at gmail.com] On Behalf Of
Anand Babu Periasamy
Sent: Thursday, April 12, 2012 7:56 PM
To: 7220022
Cc: gluster-devel at nongnu.org
Subject: Re: [Gluster-devel] GlusterFS Spare Bricks?

> -----Original Message-----
> From: abperiasamy at gmail.com [mailto:abperiasamy at gmail.com] On Behalf 
> Of Anand Babu Periasamy
> Sent: Wednesday, April 11, 2012 10:13 AM
> To: 7220022
> Cc: gluster-devel at nongnu.org
> Subject: Re: [Gluster-devel] GlusterFS Spare Bricks?
>
> On Tue, Apr 10, 2012 at 1:39 AM, 7220022 <7220022 at gmail.com> wrote:
> >
> > Are there plans to add provisioning of spare bricks in a replicated 
> > (or
> distributed-replicated) configuration? E.g., when a brick in a mirror 
> set dies, the system rebuilds it automatically on a spare, similar to 
> how it'd done by RAID controllers.
> >
> >
> >
> > Nor would it only improve the practical reliability, especially of 
> > large
> clusters, but it'd also make it possible to make better-performing 
> clusters off less expensive components. For example, instead of having 
> slow RAID5 bricks on expensive RAID controllers one uses cheap HBA-s 
> and stripes a few disks per brick in RAID0 - that's faster for writes 
> than RAID 5/6 by an order of magnitude (and, by the way, should 
> improve rebuild times in Gluster many are complaining about.).  A 
> failure of one such striped brick is not catastrophic in a mirrored 
> Gluster - but it's better to have spare bricks standing by strewn 
> across cluster heads.
> >
> >
> >
> > A more advanced setup at a hardware level involves creating "hybrid 
> > disks"
> whereas HDD vdisks are cached by enterprise-class SSD-s.  It works 
> beautifully and makes HDD-s amazingly fast for random transactions.  
> The technology's become widely available for many $500 COTS controllers.
> However, it is not widely known that the results with HDD-s in RAID0 
> under SSD cache are 10 to 20 (!!) times better than with RAID 5 or 6.
> >
> >
> >
> > There is no way to use RAID0 in commercial storage, the main reason 
> > being
> the absence of hot-spares.  If on the other hand the spares are 
> handled by Gluster in a form of (cached hardware-RAID0) pre-fabricated 
> bricks both very good performance and reasonably sufficient redundancy 
> should be easily achieved.
>
> Why not use "gluster volume replace-brick ..." command. You can use 
> external monitoring/management tools (eg. freeipmi) to detect node 
> failures and trigger replace brick through a script. GlusterFS has the 
> mechanism for hot spare, but the policy should be external.
>
> [AS] That should work, but still it'd be prone to human error.  In our 
> experience, if we've not had hotspares (block storage) we'd have 
> surely experienced catastrophic failures.  First-off, COTS disks (and 
> controllers, if we talk GlusterFS nodes) have a break-in period when 
> the bad ones fail under load within a few months.  Secondly, a lot of 
> our equipment is in remote telco facilities where power, cleanliness 
> or airconditioning can be far from ideal - leading to  increasing 
> failure rates about 2 years after deployment.  As a rule, we have at 
> least 4 hotspares per two 24-bay enclosures, while our sister company 
> with similar use profile does 4-6 spares per enclosure, as they run 
> older and less uniform equipment.
>
> A node may come back online in 5 mins, GlusterFS should not 
> automatically make decisions.
> [AS] Good point, e.g. down for maintenance
>
>  I am thinking if it makes sense to add hot-spare as a standard 
> feature, because GlusterFS detects failures.
>
> [AS] Given the reason above it'd be best if the feature could be 
> turned on and off.  Before attempting maintenance - turn off.  
> Maintenance complete and node up - the "turn hotspare on" command is 
> issued, but it's queued until the reconstruction of the node begins - 
> and takes it into consideration (won't attempt to sync to spare bricks 
> in case reconstruction to other good bricks has already began).
>
> In half the cases, failed disks and controllers fail randomly and 
> temporarily (due to dust, bad power etc.)  Most of the time the root 
> cause is unknown or is impractical to debug in a live system.  Block 
> storage SAN-s have more or less standard configuration tools that also 
> take that into account.  Here's a brief description in their 
> terminology, which may help creating the logic in GlusterFS:
>
> 1. Drives can have the statuses of Online, Unconfigured Good, 
> Unconfigured Bad, Spare (LSP, a spare local to the drive group,) 
> Global Spare (GSP, across the system) and Foreign.
> 2. vDisks can be Optimal, Degraded and Degraded, Rebuilding 3.  In 
> presence of spares, if a drive in a redundant vDisk fails the system 
> marks the drive as Unconfigured Bad and the vDisk picks up the spare 
> and enters the Rebuilding mode.
> 4.  The system won't let you make an Unconfigured Bad drive Online.  
> But you can try a "make unconfigured good" command on it.  if 
> successful, and it passes initialization and it won't show trouble in 
> SMART - include it in a new vDisk, make it a spare, etc.  If it's bad 
> - replace it.
>

Very useful points. Took notes.
-ab