[Gluster-devel] GlusterFS Spare Bricks?

Wed Apr 11 18:29:00 UTC 2012

-----Original Message-----
From: abperiasamy at gmail.com [mailto:abperiasamy at gmail.com] On Behalf Of
Anand Babu Periasamy
Sent: Wednesday, April 11, 2012 10:13 AM
To: 7220022
Cc: gluster-devel at nongnu.org
Subject: Re: [Gluster-devel] GlusterFS Spare Bricks?

On Tue, Apr 10, 2012 at 1:39 AM, 7220022 <7220022 at gmail.com> wrote:
>
> Are there plans to add provisioning of spare bricks in a replicated (or
distributed-replicated) configuration? E.g., when a brick in a mirror set
dies, the system rebuilds it automatically on a spare, similar to how it'd
done by RAID controllers.
>
>
>
> Nor would it only improve the practical reliability, especially of large
clusters, but it'd also make it possible to make better-performing clusters
off less expensive components. For example, instead of having slow RAID5
bricks on expensive RAID controllers one uses cheap HBA-s and stripes a few
disks per brick in RAID0 - that's faster for writes than RAID 5/6 by an
order of magnitude (and, by the way, should improve rebuild times in Gluster
many are complaining about.).  A failure of one such striped brick is not
catastrophic in a mirrored Gluster - but it's better to have spare bricks
standing by strewn across cluster heads.
>
>
>
> A more advanced setup at a hardware level involves creating "hybrid disks"
whereas HDD vdisks are cached by enterprise-class SSD-s.  It works
beautifully and makes HDD-s amazingly fast for random transactions.  The
technology's become widely available for many $500 COTS controllers. 
However, it is not widely known that the results with HDD-s in RAID0 under
SSD cache are 10 to 20 (!!) times better than with RAID 5 or 6.
>
>
>
> There is no way to use RAID0 in commercial storage, the main reason being
the absence of hot-spares.  If on the other hand the spares are handled by
Gluster in a form of (cached hardware-RAID0) pre-fabricated bricks both very
good performance and reasonably sufficient redundancy should be easily
achieved.

Why not use "gluster volume replace-brick ..." command. You can use external
monitoring/management tools (eg. freeipmi) to detect node failures and
trigger replace brick through a script. GlusterFS has the mechanism for hot
spare, but the policy should be external. 

[AS] That should work, but still it'd be prone to human error.  In our
experience, if we've not had hotspares (block storage) we'd have surely
experienced catastrophic failures.  First-off, COTS disks (and controllers,
if we talk GlusterFS nodes) have a break-in period when the bad ones fail
under load within a few months.  Secondly, a lot of our equipment is in
remote telco facilities where power, cleanliness or airconditioning can be
far from ideal - leading to  increasing failure rates about 2 years after
deployment.  As a rule, we have at least 4 hotspares per two 24-bay
enclosures, while our sister company with similar use profile does 4-6
spares per enclosure, as they run older and less uniform equipment.

A node may come back online in 5 mins, GlusterFS should not automatically
make decisions. 
[AS] Good point, e.g. down for maintenance

 I am thinking if it makes sense to add hot-spare as a standard feature,
because GlusterFS detects failures.

[AS] Given the reason above it'd be best if the feature could be turned on
and off.  Before attempting maintenance - turn off.  Maintenance complete
and node up - the "turn hotspare on" command is issued, but it's queued
until the reconstruction of the node begins - and takes it into
consideration (won't attempt to sync to spare bricks in case reconstruction
to other good bricks has already began).

In half the cases, failed disks and controllers fail randomly and
temporarily (due to dust, bad power etc.)  Most of the time the root cause
is unknown or is impractical to debug in a live system.  Block storage SAN-s
have more or less standard configuration tools that also take that into
account.  Here's a brief description in their terminology, which may help
creating the logic in GlusterFS:

1. Drives can have the statuses of Online, Unconfigured Good, Unconfigured
Bad, Spare (LSP, a spare local to the drive group,) Global Spare (GSP,
across the system) and Foreign.
2. vDisks can be Optimal, Degraded and Degraded, Rebuilding
3.  In presence of spares, if a drive in a redundant vDisk fails the system
marks the drive as Unconfigured Bad and the vDisk picks up the spare and
enters the Rebuilding mode.
4.  The system won't let you make an Unconfigured Bad drive Online.  But you
can try a "make unconfigured good" command on it.  if successful, and it
passes initialization and it won't show trouble in SMART - include it in a
new vDisk, make it a spare, etc.  If it's bad - replace it.

--
Anand Babu Periasamy
Blog [ http://www.unlocksmith.org ]
Twitter [ http://twitter.com/abperiasamy ]

Imagination is more important than knowledge --Albert Einstein