[Gluster-devel] Wrong assumptions about disperse

Mon Jun 20 07:08:57 UTC 2016

Hi Shyam,

On 17/06/16 15:59, Shyam wrote:
> On 06/17/2016 04:59 AM, Xavier Hernandez wrote:
>
> Firstly, thanks for the overall post, was informative and helps clarify
> some aspects of EC.
>
>> AFAIK the real problem of EC is the communications
>> layer. It adds a lot of latency and having to communicate simultaneously
>> and coordinate 6 or more bricks has a big impact.
>
> Can you elaborate this more? Is this coordination cost lesser if EC is
> coordinated from the server side graphs? (like leader/follower models in
> JBR)? I have heard some remarks about a transaction manager in Gluster,
> that you proposed, how does that help/fit in to resolve this issue?

I think one of the big problems is in the communications layer. I did 
some tests some time ago with unexpected results. On a pure distributed 
volume with a single brick mounted through FUSE on the same server that 
contains the brick (no physical network communications happen) I did the 
following tests:

* Modify protocol/server to immediately return a buffer of 0's for all 
readv requests (I virtually disable all server side xlators for readv 
requests).

Observed read speed for a dd with bs=128 KB: 349 MB/s
Observed read speed for a dd with bs=32 MB (multiple 128KB readv 
requests in parallel): 744 MB/s

* Modify protocol/client to immediately return a buffer of 0's for all 
readv requests (this avoids all RPC/networking code for readv requests).

Observed read speed for bs=128 KB: 428 MB/s
Observed read speed for bs=32 MB: 1530 MB/s

* An iperf reported a speed of 4.7 GB/s

The network layer seems to be adding a high overhead, specially when 
many requests are sent in parallel. This is very bad for disperse.

I think the coordination effort will be similar in the server side with 
current implementation. If we use the leader approach, coordination will 
be much easier/fast in theory. However all communications will be 
directed to a single server. That could make the communications problem 
worse (I haven't tested any of this, though).

The transaction approach was thought with the idea of moving fop sorting 
to the server side, without having to explicitly take locks on the 
client. This would reduce the number of network round-trips and should 
reduce the latency, improving overall performance.

This should have a perceptible impact in write requests, that currently 
are serialized on the client side. If we move the coordination to the 
server side, the client can send multiple write requests in parallel, 
making better use of the network bandwidth. This also gives the brick 
the opportunity to combine multiple write requests into a single disk 
write. This is specially important for ec, that splits big blocks into 
smaller ones for each brick.

>
> I am curious here w.r.t DHT2, where we are breaking this down into DHT2
> client and server pieces, and on the MDC (metadata cluster), the leader
> brick of DHT2 subvolume is responsible for some actions, like in-memory
> inode locking (say), which would otherwise be a cross subvolume lock
> (costlier).

Unfortunatly I haven't had time to read the details about DHT2 
implementation so I cannot say much here.

>
> We also need transactions when we are going to update 2 different
> objects with contents (simplest example is creating the inode for the
> file and linking its name into the parent directory), IOW when we have a
> composite operation.
>
> The above xaction needs recording, which is a lesser evil when dealing
> with a local brick, but will suffer performance penalties when dealing
> with replication or EC. I am looking at ways where this xaction
> recording can be compounded with the first real operation that needs to
> be performed on the subvolume, but that may not always work.
>
> So what are your thoughts in regard to improving the client side
> coordination problem that you are facing?

My point of view is that *any* coordination will work much better in the 
server side. Additionally, one of the features of the transaction 
framework was that multiple xlators could share a single transaction on 
the same inode, reducing the number of operations needed for the general 
case (currently if two xlators need an exclusive lock, each of them 
needs to issue an independent inodelk/entrylk fop). I know this is 
evolving to the leader/follower pattern, and to have data and metadata 
separated for gluster. I'm not a big fan of this approach, though.

Independently of all these changes, improving network performance will 
benefit *all* approaches.

Regards,

Xavi

>
> Thanks,
> Shyam