[Gluster-devel] Harddisk economy alternatives

Wed Nov 9 17:51:59 UTC 2011

 On Wed, 09 Nov 2011 17:50:00 +0100, Magnus Näslund 
 <magnus at arkivdigital.se> wrote:
 [...]

> We want the data replicated at least 3 times physically (box-wise),
> so we've ordered 3 test servers with 24x3TB "enterprise" SATA disks
> each with an areca card + bbu. We'll probably be running the tests
> feeding raid volumes to glusterfs, and from what I've seen this seems
> to be a standard.

 With that amount of space I hope you are going to be using something 
 like ZFS rather than normal RAID. Otherwise you are likely to find the 
 error rate will slowly and silently eat your data.

> Possible future:
>
> Since our storage system will be in it for a really long term, we're
> looking at the total economics of the solution vs. the data safety
> concerns.
>
> We've seen suggestions on letting glusterfs manage the disk directly.

 What exactly do you mean by that? GlusterFS requires a normal xattr 
 capable FS underneath it. Thus I presume you are referring to using GLFS 
 instead of RAID (i.e. stripe+distribute).

> The way I see it, this would give a win in that
> 1) We would be using all disks, no RAID/spare storage overhead
> 2) No RAID-rebuilds
> 3) ...
> 4) Profit
>
> Also, we know that any long time system we build should be planned
> with replacing disks continuously.

 My main concern with such data volumes would be the error rates of 
 modern disks. If your FS doesn't have automatic checking and block level 
 checksums, you will suffer data corruption, silent or otherwise. Quality 
 of modern disks is pretty appaling these days. One of my experiences is 
 here:
 http://www.altechnative.net/?p=120
 but it is by no means the only one.

 Currently the only FS that meets all of my reliability criteria is ZFS 
 (and the linux port works quite well now), and it has saved me from data 
 corruption, silent and otherwise, a number of times by now, in cases 
 where normal RAID wouldn't have helped.

> So in my mind we could buy quality boxes with 24-36 disks run by 3-4
> SATA controller cards (Marvell?),

 My experience with Marvell cards is limited. Do they have 8-port cards?
 I use 8-port LSI cards without any serious problems. The only issue I 
 have seen is that they tend to reset the bus when the disk is slow to 
 respond (specifically due to running a SMART self-test), which means 
 that on one hand you lose the SMART short/long self-test option for 
 monitoring, but this is mitigated by weekly ZFS scrubs which I trust 
 more anyway.

> using cheap and large desktop disks
> (maybe not the "green" variety).

 I would suggest you at the very least use disks that have 
 Write-Read-Verify capabilities. My recent experience shows that only 
 Seagates include this feature, even though, as it turns out, Samsung 
 seems to own the patent on it (and my Samsungs definitely don't have 
 that feature). If you do this, you may want to look into the WRV patch 
 for hdparm I submitted upstream, too, but there hasn't been a release of 
 it since then.

 Another good idea is to use disks of similar spec from a different 
 manufacturer in different machines, and make sure that your glfs bricks 
 are mirrored so that they have different make disks under them.

> We could have a reporting system on
> top of glusterfs that reports defective disks that would be replaced
> as part of our on-duty maintenance. Since the storage is replicated
> over 3+ boxes, the breakage of a single disk would not compromise the
> data safety as long as the disks are replaced in timely manner.

 Bear in mind that your network bandwidth is unlikely to be as good as 
 your internal disk bandwidth, and restoring a 3TB brick by doing a "ls 
 -laR" is likely to take a very long time. So you may be better off with 
 RAIDZ2/RAIDZ3 or even just mirrored volumes in each of the machines, 
 distributed using glfs, in terms of single disk failure recovery time.

 Anyway, to summarize:
 1) With large volumes of data, you need something other than the disk's 
 sector checksums to keep your data correct, i.e. a checksum checking FS. 
 If you don't, expect to see silent data corruption sooner or later.
 2) Don't use the same make of disk in all the servers - I have seen 
 multiple disks from the same manufacturer fail minutes apart more than 
 once.
 3) Use WRV features of they are available.
 4) Make sure your glfs bricks are mirrored between machines in such a 
 way that the underlying disks are different (e.g. say you have 24 disks 
 in each box, divided into 3x 8-disk RAIDZ3 volumes. Use each one of 
 those 8-disk volumes as a brick, and mirror it to a another similar 
 machine so that the 8 disks on the other server are from a different 
 manufacturer).

 The glfs part on top is relatively straightforward and will "just work" 
 provided you use a reasonably sane configuration. It is the layers 
 underneath that you will need to get right to keep your data healthy.

 Gordan