[Gluster-devel] RDMA: Patch to make use of pre registered memory

Anand Avati avati at gluster.org
Mon Feb 9 18:50:16 UTC 2015


On Sun Feb 08 2015 at 10:16:27 PM Ben England <bengland at redhat.com> wrote:

> Avati, I'm all for your zero-copy RDMA API proposal, but I have a concern
> about your proposed zero-copy fop below...
>
> ----- Original Message -----
> > From: "Anand Avati" <avati at gluster.org>
> > To: "Mohammed Rafi K C" <rkavunga at redhat.com>, "Gluster Devel" <
> gluster-devel at gluster.org>
> > Cc: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Ben Turner" <
> bturner at redhat.com>, "Ben England"
> > <bengland at redhat.com>, "Suman Debnath" <sdebnath at redhat.com>
> > Sent: Saturday, January 24, 2015 1:15:52 AM
> > Subject: Re: RDMA: Patch to make use of pre registered memory
> >
> > Couple of comments -
> >
> > ...
> > 4. Next step for zero-copy would be introduction of a new fop readto()
> > where the destination pointer is passed from the caller (gfapi being the
> > primary use case). In this situation RDMA ought to register that memory
> if
> > necessary and request server to RDMA_WRITE into the pointer provided by
> > gfapi caller.
>
> The readto() API is emulating the Linux/Unix read() system call, where the
> caller passes in the address of the read buffer.  This API was created half
> a century ago in a non-distributed world.  IMHO The caller should not
> specify where the read data should arrive, instead it should let the read
> API specify where the data arrived.  There should be a pre-registered pool
> of buffers, that both the sender and receiver *already* knew about, that
> can be used for RDMA reads, and one of these will be passed to the caller
> as part of the read "event" or completion.  This seems related to
> performance results that Rafi KC had posted earlier this month.
>
> Why does it matter?  With RDMA, the read transfer cannot begin until the
> OTHER END of the RDMA connection knows where the data will land, and it
> cannot know this soon enough if we wait until the read API call to specify
> what address to target.  An API where the caller specifies the buffer
> address *blocks* the sender, introduces latency (transmitting RDMA-able
> address to sender) and prevents pipelined, overlapping activity by sender
> and receiver.
>

If I understand your question right, you are expressing concern that
read-ahead cannot be done with readto() semantics. That is true in a sense,
but generally not a concern. The typical use case is with qemu where we
ideally want gluster server to do RDMA_WRITE of read() rpc reply straight
into the page cache of the guest. QEMU always gives a pointer to its block
layer (just like the half-century old Unix read()) to fill data in. So this
is a given constraint under which we need to work. The reason this is not
as grave a problem as it appears is because all these are working behind
the read-ahead of the guest VM. Guest VM's read-ahead (if it is linux,
probably other OSes too) is typically asynchronous and as long as gluster
can handle multiple RDMA/RPC in parallel/pipeline (which it can) the
"blocks the sender" problem does not really exist.



> So a read FOP for RDMA should be more like read_completion_event(buffer **
> read_data_delivered).   It is possible to change libgfapi to support this
> since it does not have to conform rigidly to POSIX.  Could this work in
> Gluster translator API?   RPC interface?
>
> So then how would the remote sender find out when it was ok to re-use this
> buffer to service another RDMA read request?   Is there an interface,
> something like read_buffer_consumed(buffer * available_buf), on read API
> side that indicates to RDMA that the caller has consumed the buffer and it
> is ready for re-use, without the added expense of unregistering and
> re-registering?
>
> If so, then you then have a pipeline of buffers in one of 4 states:
>
> - in transmission by sender to reader
> - being consumed by reader
> - being returned to sender for re-use
> - available to sender
> - go back to state 1
>
> By increasing the number of buffers sufficiently, we can avoid a situation
> where round-trip latency prevents you from filling the gigantic 40-Gbps
> (56-Gbps for FDR IB) RDMA pipeline.
>
> I'm also interested in how writes work - how do we avoid copies on the
> write path and also avoid having to re-register buffers with each write?
>
> BTW None of these concerns, or the concerns discussed by Rafi KC, are
> addressed in the Gluster 3.6 RDMA feature page.
>
>
We had a very similar API in the previous incarnation of libgfapi (which
was called libglusterfsclient) where read would just return the iobuf which
could possibly have been read-ahead into or io-cache'ed in the past. It had
the equivalent of read_buffer_consumed() etc as well. This is a fine
approach in terms of efficiency, but in practice you would need to create
an app from scratch designed for this style of API. The reality is that
applications like to keep control of memory and what lands where, and do
not like memory to be dictated and managed by underlying layers. What I
mean is, even if we provide an API like what you suggest, the caller would
most likely just memcpy() the data in iobuf into its own managed buffer
anyways. Also the application has to now be very careful of calling
read_buffer_consumed() instead of free() depending on the specific buffer.
This can be very tricky has free() could be done in so many places and
deeply nested layers in the application and all those places should now
have awareness that a buffer could either be malloced or originated from
gluster - this model is a big failure, we saw it in libglusterfslcient.

So, it is always best to let read-ahead happen in a layer as close to the
client as possible. In case of qemu/gfapi it is best left to the guest VM
to do read-ahead, and all layers below do the best w.r.t avoiding memcpy.
In other gfapi use cases read-ahead xlator makes sense (when the app needs
to be simple) and in such cases memcpy() is unavoidable (the cost one pays
for having a simple app and expecting performance)

Thanks
Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150209/6873a0f2/attachment-0001.html>


More information about the Gluster-devel mailing list