[Gluster-users] Inviting comments on my plans

Sun Nov 18 16:27:41 UTC 2012

On 11/18/2012 5:19 AM, Brian Candler wrote:
> On Sat, Nov 17, 2012 at 11:04:33AM -0700, Shawn Heisey wrote:
>> Dell R720xd servers with two internal OS drives and 12 hot-swap
>> external 3.5 inch bays.  Fedora 18 alpha, to be upgraded to Fedora
>> 18 when it is released.
> I would strongly recommend *against* Fedora in any production environment,
> simply because there are new releases every 6 months, and releases are only
> supported for 18 months from release.  You are therefore locked into a
> complete OS reinstall every 6 months (or at best, three upgrades every 18
> months).
>
> If you want something that's free and RPM-based for production, I suggest
> you use CentOS or Scientific Linux.
>
>> 2TB simple LVM volumes for bricks.
>> A combination of 4TB disks (two bricks per drive) and 2TB disks.
> With no RAID, 100% reliant on gluster replication? You discussed this later
> but I would still advise against this.  If you go this route, you will need
> to be very sure about your procedures for (a) detecting failed drives, and
> (b) replacing failed drives.  It's certainly not a simple pull-out/push-in
> (or rebuild-on-hot-spare) as it would be with RAID.  You'll have to
> introduce a new drive, create the filesystem (or two filesystems on a 4TB
> drive), and reintroduce those filesystems as bricks into gluster: but not
> using replace-brick because the failed brick will have gone.  So you need to
> be confident in the abilities of your operational staff to do this.
>
> If you do it this way, please test and document it for the rest of us.
>
>> Now for the really controversial part of my plans: Left-hand brick
>> filesystems (listed first in each replica set) will be XFS,
>> right-hand bricks will be BTRFS.  The idea here is that we will have
>> one copy of the volume on a fully battle-tested and reliable
>> filesystem, and another copy of the filesystem stored in a way that
>> we can create periodic snapshots for last-ditch "oops" recovery.
>> Because of the distributed nature of the filesystem, using those
>> snapshots will not be straightforward, but it will be POSSIBLE.
> Of course it depends on your HA requirements, but another approach would be
> to have non-replicated volume (XFS) and then geo-replicate to another server
> with BTRFS, and do your snapshotting there. Then your "live" data is not
> dependent on BTRFS issues.
>
> This also has the bonus that your BTRFS server could be network-remote.
>
>> * Performance.
>> RAID 5/6 comes with a severe penalty on performance during sustained
>> writes -- writing more data than will fit in your RAID controller's
>> cache memory.  Also, if you have a failed disk, all performance is
>> greatly impacted during the entire rebuild process, which for a 4TB
>> disk is likely to take a few days.
> Actually, sustained sequential writes are the best case for RAID5/6. It's
> random writes which will kill you.
>
> If random write performance is important I'd use RAID10 - which means for a
> fully populated server you'll get 24TB instead of 48TB.  Linux mdraid "far
> 2" layout will give you the same read performance as RAID0, indeed somewhat
> faster because all the seeks are within the first half of the drive, but
> with data replication.
>
> With georeplication, your BTRFS backup server could be RAID5 or RAID6
> though.
>
> So it's down to the relative importance of various things:
> - sufficient capacity
> - sufficient performance
> - acceptable cost
> - ease of management (when a drive fails)
> - data availability (if an entire server fails)
>
> For me, "ease of management (when a drive fails)" comes very high on the
> list, because drive failures *will* happen, and you need to deal with them
> as a matter-of-course. You might not feel the same way.
>
> I wrote "sufficient capacity/performance" rather than "maximum
> capacity/performance" because it depends what your business requirements
> are.  I mean, having no RAID might give you maximum performance on those 4TB
> drives, but is even that good enough for your needs?  If not, you might want
> to revisit and go with SSDs.  On the other hand, RAID6 might not be the
> *best* write performance, but it might actually be good enough depending on
> what you're doing.

The regular performance of RAID6 isn't a MAJOR problem. It's annoying, 
but would be workable -- our current SAN volumes are RAID6.  The problem 
is performance when you've got a failed drive.  Until you replace the 
drive, performance is impacted somewhat by using parity calculations 
across all drives to reconstruct the missing data.  Once you replace the 
drive, performance is severely impacted until the rebuild is complete.

Cost is the primary limiting factor.  SSD?  No way.  A second 
geo-replicated volume with BTRFS for snapshots?  It would have to be 
just as big as the production volume -- which is going to be a minimum 
of 150TB very quickly, with no upper end in sight.    There's no way I 
would get funding for that.  Losing 8TB of data per server pair to RAID 
is a similar funding problem.

Having half the drives on BTRFS was my primary reason for Fedora.  
Although BTRFS is probably present in CentOS 6, the kernel is so old 
that I wouldn't trust it.  Due to the cost of providing it, we won't 
have traditional backups.  We'll be relying on replication.  Replication 
does not protect against accidental deletion, though.  BTRFS snapshots 
would.

I looked into a lot of solutions before we settled on Gluster.  The 
front-runner from a technical perspective was Ceph, because of its 
snapshot support.  We never even tested it, though -- without NFS, we 
cannot support Solaris clients, and for the next year or two, we can't 
eliminate Solaris.

Upgrading every 6 months sounds like a royal pain, so perhaps I won't do 
Fedora/BTRFS.  Do you happen to know if it will be possible to upgrade 
from CentOS 6 to CentOS 7?  The lack of an upgrade path from 5 to 6 has 
been a major headache.

I'm aware of the additional administrative overhead that drive failures 
will require, and I definitely know that we WILL have failures.

Thanks,
Shawn