[Gluster-devel] NSR design document

Thu Oct 15 08:53:34 UTC 2015

----- Original Message -----
> October 14 2015 3:11 PM, "Manoj Pillai" <mpillai at redhat.com> wrote:
> > E.g. 3x number of bricks could be a problem if workload has
> > operations that don't scale well with brick count.
> 
> Fortunately we have DHT2 to address that.
> 
> > Plus the brick
> > configuration guidelines would not exactly be elegant.
> 
> And we have Heketi to address that.
> 
> > FWIW, if I look at the performance and perf regressions tests
> > that are run at my place of work (as these tests stand today), I'd
> > expect AFR to significantly outperform this design on reads.
> 
> Reads tend to be absorbed by caches above us, *especially* in read-only
> workloads.  See Rosenblum and Ousterhout's 1992 log-structured file
> system paper, and about a bazillion others ever since.  

Yes, their point was that read absorption means the request 
stream at the secondary storage is dominated by writes, so you 
optimize for that. Plus, the non-overwrite mode of update has 
additional benefits, like easier implementation of snapshots 
or versioning, better recovery guarantees. And I think these 
additional benefits still hold true today, which is why there 
is continued interest in similar solutions. But a lot of data has 
flowed over the wires since 1992, and with the explosion in 
data sets sizes, read performance at the lower storage layers 
continues to be the determinant of overall performance for many use 
uses, (add stress) is what I think. Particularly among those shopping 
for a scale-out storage solution to fit their large data 
sets and modern workloads. Update-in-place file systems 
like XFS have endured quite well. 

> We need to be
> concerned at least as much about write performance, and NSR's write
> performance will *far* exceed AFR's because AFR uses neither networks
> nor disks efficiently.  It splits client bandwidth between N replicas,
> and it sprays writes all over the disk (data blocks plus inode plus
> index).  Most other storage systems designed in the last ten years can
> turn that into nice sequential journal writes, which can even be on a
> separate SSD or NVMe device (something AFR can't leverage at all).
> Before work on NSR ever started, I had already compared AFR to other
> file systems using these same methods and data flows (e.g. Ceph and
> MooseFS) many times.  Consistently, I'd see that the difference was
> quite a bit more than theoretical.  Despite all of the optimization work
> we've done on it, AFR's write behavior is still a huge millstone around
> our necks.
> 
> OK, let's bring some of these thoughts together.  If you've read
> Hennessy and Patterson, you've probably seen this formula before.
> 
>     value (of an optimization) =
>         benefit_when_applicable * probability -
>         penalty_when_inapplicable * (1 - probability)
> 
> If NSR's write performance is significantly better than AFR's, and write
> performance is either dominant or at least highly relevant for most real
> workloads, what does that mean for performance overall?  As prototyping
> showed long ago, it means a significant improvement.  Is it *possible*
> to construct a read-dominant workload that shows something different?
> Of course it is.  It's even possible that write performance will degrade
> in certain (increasingly rare) physical configurations.  No design is
> best for every configuration and workload.  Some people tried to focus
> on the outliers when NSR was first proposed.  Our competitors will be
> glad to do the same, for the same reason - to keep their own pet designs
> from looking too bad.  The important question is whether performance
> improves for *most* real-world configurations and workloads.  NSR is
> quite deliberately somewhat write-optimized, because it's where we were
> the furthest behind and because it's the harder problem to solve.
> Optimizing for read-only workloads leaves users with any other kind of
> workload in a permanent hole.
> 
> Also, even for read-heavy workloads where we might see a deficit, we
> have not one but two workarounds.  One (brick splitting) we've just
> discussed, and it is quite deliberately being paired with other
> technologies in 4.0 to make it more effective.  The other (read from
> non-leaders) is also perfectly viable.  It's not the default because it
> reduces consistency to AFR levels, which I don't think serves our users
> very well.  However, if somebody's determined to make AFR comparisons,
> then it's only fair to compare at the same consistency level.  Giving
> users the ability to decide on such tradeoffs, instead of forcing one
> choice on everyone, has been part of NSR's design since day one.

And if there are improvements that can make the non-default option 
(read from non-leaders as well) more palatable, they would be really good 
to have, is what I think. 

> 
> I'm not saying your concern is invalid, but NSR's leader-based approach
> is *essential* to improving write performance - and thus performance
> overall - for most use cases.  It's also essential to improving
> functional behavior, especially with respect to split brain, and I
> consider that even more important.  Sure, reads don't benefit as much.
> They might even get worse, though that remains to be seen and is only
> likely to be true in certain scenarios.  As long as we know how to work
> around that, is there any need to dwell on it further?
>

Not for me.

-- Manoj