[Gluster-devel] NSR design document

Wed Oct 14 21:26:49 UTC 2015

October 14 2015 3:11 PM, "Manoj Pillai" <mpillai at redhat.com> wrote:
> E.g. 3x number of bricks could be a problem if workload has
> operations that don't scale well with brick count.

Fortunately we have DHT2 to address that.

> Plus the brick
> configuration guidelines would not exactly be elegant.

And we have Heketi to address that.

> FWIW, if I look at the performance and perf regressions tests
> that are run at my place of work (as these tests stand today), I'd
> expect AFR to significantly outperform this design on reads.

Reads tend to be absorbed by caches above us, *especially* in read-only
workloads.  See Rosenblum and Ousterhout's 1992 log-structured file
system paper, and about a bazillion others ever since.  We need to be
concerned at least as much about write performance, and NSR's write
performance will *far* exceed AFR's because AFR uses neither networks
nor disks efficiently.  It splits client bandwidth between N replicas,
and it sprays writes all over the disk (data blocks plus inode plus
index).  Most other storage systems designed in the last ten years can
turn that into nice sequential journal writes, which can even be on a
separate SSD or NVMe device (something AFR can't leverage at all).
Before work on NSR ever started, I had already compared AFR to other
file systems using these same methods and data flows (e.g. Ceph and
MooseFS) many times.  Consistently, I'd see that the difference was
quite a bit more than theoretical.  Despite all of the optimization work
we've done on it, AFR's write behavior is still a huge millstone around
our necks.

OK, let's bring some of these thoughts together.  If you've read
Hennessy and Patterson, you've probably seen this formula before.

    value (of an optimization) =
        benefit_when_applicable * probability -
        penalty_when_inapplicable * (1 - probability)

If NSR's write performance is significantly better than AFR's, and write
performance is either dominant or at least highly relevant for most real
workloads, what does that mean for performance overall?  As prototyping
showed long ago, it means a significant improvement.  Is it *possible*
to construct a read-dominant workload that shows something different?
Of course it is.  It's even possible that write performance will degrade
in certain (increasingly rare) physical configurations.  No design is
best for every configuration and workload.  Some people tried to focus
on the outliers when NSR was first proposed.  Our competitors will be
glad to do the same, for the same reason - to keep their own pet designs
from looking too bad.  The important question is whether performance
improves for *most* real-world configurations and workloads.  NSR is
quite deliberately somewhat write-optimized, because it's where we were
the furthest behind and because it's the harder problem to solve.
Optimizing for read-only workloads leaves users with any other kind of
workload in a permanent hole.

Also, even for read-heavy workloads where we might see a deficit, we
have not one but two workarounds.  One (brick splitting) we've just
discussed, and it is quite deliberately being paired with other
technologies in 4.0 to make it more effective.  The other (read from
non-leaders) is also perfectly viable.  It's not the default because it
reduces consistency to AFR levels, which I don't think serves our users
very well.  However, if somebody's determined to make AFR comparisons,
then it's only fair to compare at the same consistency level.  Giving
users the ability to decide on such tradeoffs, instead of forcing one
choice on everyone, has been part of NSR's design since day one.

I'm not saying your concern is invalid, but NSR's leader-based approach
is *essential* to improving write performance - and thus performance
overall - for most use cases.  It's also essential to improving
functional behavior, especially with respect to split brain, and I
consider that even more important.  Sure, reads don't benefit as much.
They might even get worse, though that remains to be seen and is only
likely to be true in certain scenarios.  As long as we know how to work
around that, is there any need to dwell on it further?