[Gluster-users] Inviting comments on my plans

Sun Nov 18 12:19:15 UTC 2012

On Sat, Nov 17, 2012 at 11:04:33AM -0700, Shawn Heisey wrote:
> Dell R720xd servers with two internal OS drives and 12 hot-swap
> external 3.5 inch bays.  Fedora 18 alpha, to be upgraded to Fedora
> 18 when it is released.

I would strongly recommend *against* Fedora in any production environment,
simply because there are new releases every 6 months, and releases are only
supported for 18 months from release.  You are therefore locked into a
complete OS reinstall every 6 months (or at best, three upgrades every 18
months).

If you want something that's free and RPM-based for production, I suggest
you use CentOS or Scientific Linux.

> 2TB simple LVM volumes for bricks.
> A combination of 4TB disks (two bricks per drive) and 2TB disks.

With no RAID, 100% reliant on gluster replication? You discussed this later
but I would still advise against this.  If you go this route, you will need
to be very sure about your procedures for (a) detecting failed drives, and
(b) replacing failed drives.  It's certainly not a simple pull-out/push-in
(or rebuild-on-hot-spare) as it would be with RAID.  You'll have to
introduce a new drive, create the filesystem (or two filesystems on a 4TB
drive), and reintroduce those filesystems as bricks into gluster: but not
using replace-brick because the failed brick will have gone.  So you need to
be confident in the abilities of your operational staff to do this.

If you do it this way, please test and document it for the rest of us.

> Now for the really controversial part of my plans: Left-hand brick
> filesystems (listed first in each replica set) will be XFS,
> right-hand bricks will be BTRFS.  The idea here is that we will have
> one copy of the volume on a fully battle-tested and reliable
> filesystem, and another copy of the filesystem stored in a way that
> we can create periodic snapshots for last-ditch "oops" recovery.
> Because of the distributed nature of the filesystem, using those
> snapshots will not be straightforward, but it will be POSSIBLE.

Of course it depends on your HA requirements, but another approach would be
to have non-replicated volume (XFS) and then geo-replicate to another server
with BTRFS, and do your snapshotting there. Then your "live" data is not
dependent on BTRFS issues.

This also has the bonus that your BTRFS server could be network-remote.

> * Performance.
> RAID 5/6 comes with a severe penalty on performance during sustained
> writes -- writing more data than will fit in your RAID controller's
> cache memory.  Also, if you have a failed disk, all performance is
> greatly impacted during the entire rebuild process, which for a 4TB
> disk is likely to take a few days.

Actually, sustained sequential writes are the best case for RAID5/6. It's
random writes which will kill you.

If random write performance is important I'd use RAID10 - which means for a
fully populated server you'll get 24TB instead of 48TB.  Linux mdraid "far
2" layout will give you the same read performance as RAID0, indeed somewhat
faster because all the seeks are within the first half of the drive, but
with data replication.

With georeplication, your BTRFS backup server could be RAID5 or RAID6
though.

So it's down to the relative importance of various things:
- sufficient capacity
- sufficient performance
- acceptable cost
- ease of management (when a drive fails)
- data availability (if an entire server fails)

For me, "ease of management (when a drive fails)" comes very high on the
list, because drive failures *will* happen, and you need to deal with them
as a matter-of-course. You might not feel the same way.

I wrote "sufficient capacity/performance" rather than "maximum
capacity/performance" because it depends what your business requirements
are.  I mean, having no RAID might give you maximum performance on those 4TB
drives, but is even that good enough for your needs?  If not, you might want
to revisit and go with SSDs.  On the other hand, RAID6 might not be the
*best* write performance, but it might actually be good enough depending on
what you're doing.

Regards,

Brian.