[Gluster-devel] RDMA: Patch to make use of pre registered memory

Mon Feb 9 06:16:26 UTC 2015

Avati, I'm all for your zero-copy RDMA API proposal, but I have a concern about your proposed zero-copy fop below...

----- Original Message -----
> From: "Anand Avati" <avati at gluster.org>
> To: "Mohammed Rafi K C" <rkavunga at redhat.com>, "Gluster Devel" <gluster-devel at gluster.org>
> Cc: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Ben Turner" <bturner at redhat.com>, "Ben England"
> <bengland at redhat.com>, "Suman Debnath" <sdebnath at redhat.com>
> Sent: Saturday, January 24, 2015 1:15:52 AM
> Subject: Re: RDMA: Patch to make use of pre registered memory
> 
> Couple of comments -
> 
> ...
> 4. Next step for zero-copy would be introduction of a new fop readto()
> where the destination pointer is passed from the caller (gfapi being the
> primary use case). In this situation RDMA ought to register that memory if
> necessary and request server to RDMA_WRITE into the pointer provided by
> gfapi caller.

The readto() API is emulating the Linux/Unix read() system call, where the caller passes in the address of the read buffer.  This API was created half a century ago in a non-distributed world.  IMHO The caller should not specify where the read data should arrive, instead it should let the read API specify where the data arrived.  There should be a pre-registered pool of buffers, that both the sender and receiver *already* knew about, that can be used for RDMA reads, and one of these will be passed to the caller as part of the read "event" or completion.  This seems related to performance results that Rafi KC had posted earlier this month.

Why does it matter?  With RDMA, the read transfer cannot begin until the OTHER END of the RDMA connection knows where the data will land, and it cannot know this soon enough if we wait until the read API call to specify what address to target.  An API where the caller specifies the buffer address *blocks* the sender, introduces latency (transmitting RDMA-able address to sender) and prevents pipelined, overlapping activity by sender and receiver. 

So a read FOP for RDMA should be more like read_completion_event(buffer ** read_data_delivered).   It is possible to change libgfapi to support this since it does not have to conform rigidly to POSIX.  Could this work in Gluster translator API?   RPC interface?

So then how would the remote sender find out when it was ok to re-use this buffer to service another RDMA read request?   Is there an interface, something like read_buffer_consumed(buffer * available_buf), on read API side that indicates to RDMA that the caller has consumed the buffer and it is ready for re-use, without the added expense of unregistering and re-registering?

If so, then you then have a pipeline of buffers in one of 4 states:

- in transmission by sender to reader
- being consumed by reader
- being returned to sender for re-use
- available to sender 
- go back to state 1

By increasing the number of buffers sufficiently, we can avoid a situation where round-trip latency prevents you from filling the gigantic 40-Gbps (56-Gbps for FDR IB) RDMA pipeline.

I'm also interested in how writes work - how do we avoid copies on the write path and also avoid having to re-register buffers with each write?

BTW None of these concerns, or the concerns discussed by Rafi KC, are addressed in the Gluster 3.6 RDMA feature page.

-ben (e)