[Gluster-devel] Wrong assumptions about disperse

Fri Jun 17 08:59:15 UTC 2016

Hi all,

I've seen in many places the belief that disperse, or erasure coding in 
general, is slow because of the complex or costly math involved. It's 
true that there's an overhead compared to a simple copy like replica 
does, but this overhead is way more smaller than many people think.

The math used by disperse, if tested alone outside gluster, is much 
faster than it seems. AFAIK the real problem of EC is the communications 
layer. It adds a lot of latency and having to communicate simultaneously 
and coordinate 6 or more bricks has a big impact.

Erasure coding also suffers from partial writes, that require a 
read-modify-write cycle. However this is completely avoided in many 
situations where the volume is optimally configured and writes are in 
blocks of multiples of 4096 bytes and aligned (typical on VMs, databases 
and many other workloads). It could even be avoided in other situations 
taking advantage of the write-behind xlator (not done yet).

I've used a single core of two machines to test the raw math: one quite 
limited (Atom D525 1.8 GHz) and another more powerful but not a top CPU 
(Xeon E5-2630L 2.0 GHz).

Common parameters:

* nonsystematic vandermonde matrix (the same used by ec)
* algorithm slightly slower than the one used bye ec (I haven't 
implemented some optimizations in the test program, but I think the 
difference should be very small)
* buffer size: 128 KiB
* number of iterations: 16384
* total size processed: 2 GiB
* results in MiB/s for a single core

Config   Atom   Xeon
   2+1     633   1856
   4+1     405   1203
   4+2     324    984
   4+3     275    807
   8+2     227    611
   8+3     202    545
   8+4     182    501
  16+3     116    303
  16+4     111    295

The same tests using Intel SSE2 extensions (not present in EC yet, but 
the patch is in review):

Config   Atom   Xeon
   2+1     821   3047
   4+1     767   2246
   4+2     629   1887
   4+3     535   1632
   8+2     466   1237
   8+3     423   1104
   8+4     388   1044
  16+3     289    675
  16+4     271    637

With AVX2 it should be faster, but my machines doesn't support it.

This is even much much faster when a systematic matrix is used. For 
example a 16+4 configuration using SSE on a Xeon core can encode at 3865 
MiB/s. However this won't be a big difference inside gluster.

Currently EC encoding/decoding for small/medium configurations is not 
the bottle-neck of disperse. Maybe for big configurations on slow 
machines, it could have some impact (I don't have resources to test 
those big configurations properly).

Regards,

Xavi