[Gluster-users] Design/HW for cost-efficient NL archive >= 0.5PB?

Thu Jan 2 17:06:46 UTC 2014

1) It depends on the number of drives per chassis, your tolerance for risk,
and the speed of rebuilds.  I'd recommend doing a couple of test rebuilds
with different array sizes to see how fast your controller and drives can
complete them, and then comparing the rebuild completion times to your
SLA-- if a rebuild takes two days to complete, is that good enough for you
(especially given the chances of another failure occuring during the
rebuild)?  All other things being equal, the smaller the array, the faster
the rebuild, but the more "wasted" space in the array.  Also note that many
controllers have tunable rebuild algorithms, so you can divert more
resources to completing rebuilds faster at the cost of performance.  One
data point from me: my last 16-2T-SATA RAID-6 rebuild took about 58 hours
to complete.

2) My understanding is that the way file reads work on GlusterFS, read
requests are sent to all nodes and the data is used from the first node to
respond to the request.  So if one node is busier than others, it is likely
to respond more slowly and thus receive a lower portion of the read
activity, as long as the files being read are larger than a single
response.

On Wed, Jan 1, 2014 at 12:21 PM, Fredrik Häll <hall.fredrik at gmail.com>wrote:

> Thanks for all the input!
>
> It sure sounds like RAID-6 for disk failures and Gluster for the spanning
> and high level redundancy parts is a good candidate.
>
> Some final questions:
>
> 1) How big can one comfortably go in terms of RAID-6 array size? Given 4TB
> SATA/SAS drives. On the one hand much points to keeping as few RAIDs as
> possible, and disk usage is of course maximized. But there are
> complications in terms of rebuild times and risk of losing the 2 drives.
> Hot spares may also be an option. Your reflections?
>
> 2) Is there any intelligence or automation in Gluster that makes smart use
> of dual (or multiple) replicas? Say that I have 2 replicas, and one of them
> is spending some effort on a RAID rebuild, is there functionality for
> manually or automatically preferring the other (healhy) replica?
>
> Best regards,
>
> Fredrik
>
>
> On Tue, Dec 31, 2013 at 10:27 PM, Justin Dossey <jbd at podomatic.com> wrote:
>
>> Yes, RAID-6 is better than RAID-5 in most cases.  I agonized over the
>> decision to deploy 5 for my Gluster cluster, and the reason I went with 5
>> is that the number of drives in the brick was (IMO) acceptably low.  I use
>> 6 for my 16-drive arrays, which means I have to lose 3 disks out of the 16
>> to lose my data.  With 2x8-drive arrays in 5, I also have to lose 3 disks
>> to lose data, but if I do lose data, I only lose 50% of the data on the
>> server, and all these bricks are distribute-replicate anyway, so I wouldn't
>> actually lose any data at all.  That consideration, paired with the fact
>> that I keep spares on hand and replace failed drives within a day or two,
>> means that I'm okay with running 2x RAID-5 instead of 1x RAID-6.  (2x
>> RAID-6 would put me below my storage target, forcing additional hardware
>> purchases.)
>>
>> I suppose the short answer is "evaluate your storage needs carefully."
>>
>>
>> On Tue, Dec 31, 2013 at 11:19 AM, James <purpleidea at gmail.com> wrote:
>>
>>> On Tue, Dec 31, 2013 at 11:33 AM, Justin Dossey <jbd at podomatic.com>
>>> wrote:
>>> >
>>> > Yes, I'd recommend sticking with RAID in addition to GlusterFS.  The
>>> cluster I'm mid-build on (it's a live migration) is 18x RAID-5 bricks on 9
>>> servers.  Each RAID-5 brick is 8 2T drives, so about 13T usable.  It's
>>> better to deal with a RAID when a disk fails than to have to pull and
>>> replace the brick, and I believe Red Hat's official recommendation is still
>>> to minimize the number of bricks per server (which makes me a rebel for
>>> having two, I suppose).  9 (slow-ish, SATA RAID) servers easily saturate
>>> 1Gbit on a busy day.
>>>
>>>
>>> I think RedHat also recommends RAID6 instead of RAID5. In any case, I
>>> sure do, at least.
>>>
>>> James
>>>
>>>
>>>
>>> On Mon, Dec 30, 2013 at 5:54 AM, bernhard glomm
>>> <bernhard.glomm at ecologic.eu> wrote:
>>> >
>>> > some years ago I had a similar tasks.
>>> > I did:
>>> > - We had disk arrays with 24 slots, with optional 4 JBODS (each 24
>>> slots) stacked on top, dual LWL controller 4GB (costs ;-)
>>> > - creating raids (6) with not more than 7 disks each
>>> > - as far as I remember I had one hot spare per each 4 raids
>>> > - connecting as many of this raid bricks together with striped
>>> glusterfs as needed
>>> > - as for replication, I was planing for an offside duplicate of this
>>> architecture and
>>> > because losing data was REALLY not an option, writing it all off at a
>>> second offside location onto LTFS tapes.
>>> > As the original version for the LTFS library edition was far to
>>> expensive for us
>>> > I found an alternative solution that does the same thing
>>> > but fort a much reasonable prize. LTFS is still a big thing in digital
>>> Archiving.
>>> > Give me a note if you like more details on that.
>>> >
>>> > - This way I could fsck all (not to big) raids in parallel (sped
>>> things up)
>>> > - proper robustness against disk failure
>>> > - space that could grow infinite in size (add more and bigger disks)
>>> and keep up with access speed (ad more server) at a pretty foreseeable prize
>>> > - LTFS in the vault provided just the finishing having data accessible
>>> even if two out three sides are down,
>>> > reasonable prize, (for instance no heat problem at the tape location)
>>> > Nowadays I would go for the same approach except zfs raidz3 bricks (at
>>> least do a thorough test on it)
>>> > instead of (small) hardware raid bricks.
>>> > As for simplicity and robustness I wouldn't like to end up with
>>> several hundred glusterfs bricks, each on one individual disk,
>>> > but rather leaving disk failure prevention either to hardware raid or
>>> zfs and using gluster to connect this bricks into the
>>> > fs size I need(  - and for mirroring the whole thing to a second side
>>> if needed)
>>> > hth
>>> > Bernhard
>>> >
>>> >
>>> >
>>> > Bernhard Glomm
>>> > IT Administration
>>> >
>>> > Phone: +49 (30) 86880 134
>>> > Fax: +49 (30) 86880 100
>>> > Skype: bernhard.glomm.ecologic
>>> > Ecologic Institut gemeinnützige GmbH | Pfalzburger Str. 43/44 | 10717
>>> Berlin | Germany
>>> > GF: R. Andreas Kraemer | AG: Charlottenburg HRB 57947 | USt/VAT-IdNr.:
>>> DE811963464
>>> > Ecologic™ is a Trade Mark (TM) of Ecologic Institut gemeinnützige GmbH
>>> > ________________________________
>>> >
>>> > On Dec 25, 2013, at 8:47 PM, Fredrik Häll <hall.fredrik at gmail.com>
>>> wrote:
>>> >
>>> > I am new to Gluster, but so far it seems very attractive for my needs.
>>> I am trying to assess its suitability for a cost-efficient storage problem
>>> I am tackling. Hopefully someone can help me find how to best solve my
>>> problem.
>>> >
>>> > Capacity:
>>> > Start with around 0.5PB usable
>>> >
>>> > Redundancy:
>>> > 2 replicas with non-RAID is not sufficient. Either 3 replicas with
>>> non-raid or some combination of 2 replicas and RAID?
>>> >
>>> > File types:
>>> > Large files, around 400-1500MB each.
>>> >
>>> > Usage pattern:
>>> > Archive (not sure if this matches nearline or not..) with files being
>>> added at around 200-300GB/day (3-400 files/day). Very few reads, order of
>>> 10 file accesses per day. Concurrent reads highly unlikely.
>>> >
>>> > The main two factors for me are cost and redundancy. Losing data is
>>> not an option, being an archive solution. Cost/usable TB is the other key
>>> factor, as we see growth estimates of 100-500TB/year.
>>> >
>>> > Looking just at $/TB, a RAID-based approach to me sounds more
>>> efficient. But RAID rebuild times with large arrays of large capacity
>>> drives sound really scary. Not sure if something smart can be done since we
>>> will still have a replica left during the rebuild?
>>> >
>>> > So, any suggestions on what would be possible and cost-efficient
>>> solutions?
>>> >
>>> > - Any experience on dense servers, what is advisable? 24/36/50/60
>>> slots?
>>> > - SAS expanders/storage pods?
>>> > - RAID vs non-RAID?
>>> > - Number of replicas etc?
>>> >
>>> > Best,
>>> >
>>> > Fredrik
>>> > _______________________________________________
>>> > Gluster-users mailing list
>>> > Gluster-users at gluster.org
>>> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Gluster-users mailing list
>>> > Gluster-users at gluster.org
>>> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>>
>>> > --
>>> > Justin Dossey
>>> > CTO, PodOmatic
>>> >
>>> >
>>> > _______________________________________________
>>> > Gluster-users mailing list
>>> > Gluster-users at gluster.org
>>> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>>
>> --
>> Justin Dossey
>> CTO, PodOmatic
>>
>>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>

-- 
Justin Dossey
CTO, PodOmatic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140102/8f42c3bc/attachment.html>