[Gluster-devel] zero-copy readv

Anand Avati aavati at redhat.com
Thu Jan 10 06:25:15 UTC 2013

(cc'ing gluster-devel)

Bharata, I got back from vacation only today. Apologies for the delay. 
Please find my reply inline.

On 01/09/2013 07:21 PM, Bharata B Rao wrote:
> Avati,
> I have some time to work on this item now and I would appreciate any quick
> inputs from you on this.
> Regards,
> Bharata.
> On Thu, Jan 03, 2013 at 12:27:33PM +0530, Bharata B Rao wrote:
>> Hi,
>> Wish you all a happy new year!
>> Avati and I had a brief chat regarding zero-copy readv last year and I would
>> like to spend some effort now on this. To be sure that we aren't duplicating
>> efforts, I would like to ask if anybody in your org is already working on it.
>> During my first glance through the code I see that the read data coming in
>> from the rpc socket is put into an iov which traverses via several translators
>> before being presented as @iov in glfs_preadv. I can see the read data being
>> copied onto the user supplied iov in glfs_preadv via iov_copy. So there are
>> afterall not many copies happening as I had assumed earlier. There is only
>> one copy in glfs_readv and rest of the xlators (I haven't looked at cluster
>> xlators though) just work on the iov generated in the rpc layer.
>> So Avati, when you discussed about zero-copy, did you mean the data from
>> the rpc socked should be read directly into user supplied iov buffer ? I guess
>> that's not such an easy thing to do and I am not sure if that is even
>> preferred.

That is correct. Currently we have one copy (not two or more) which we 
need to bring down to zero. The extra memory copy can have adverse 
effects on the system caches (L1/L2/L3) by blowing up all the 
optimizations performed by the hardware.

>> Given that I don't see any avenues to reduce the number of iov/buffer copies
>> in the read path, can you throw some light on any other places in the read
>> path, where there are redundant copies that could be removed/optimized ?

Implementing zero-copy is required both in the read path and the write 
path. The underlying principles for implementing zero-copy are actually 
very similar to those followed by the kernel for implementing O_DIRECT, i.e,

- provide special variants of the read/write fops (new fops?), one which 
follows the zero copy (direct) path with special iobufs holding pointers 
to user provided memory and another with regular iobufs into/out of 
which user data is copied. The new fop does NOT require change in the 
protocol or affect version compatibility between client and server. Both 
versions can interoperate (think how NFS protocol is agnostic to 
O_DIRECT behavior on both client and server side)

- On the write side, things are relatively simpler. Use special iobufs 
around user provided memory, make it a synchronous write in write-behind 
(i.e, act like a barrier for previous incomplete writes on the 
overlapping region and avoid small_write_collapse() optimization on it).

- On the read side things are a little more complicated. In 
rpc-transport/socket, there is a call to iobuf_get() to create a new 
iobuf for reading in the readv reply data from the server. We will need 
a framework changes where, if the readv request (of the xid for which 
readv reply is being handled) happened to be a "direct" variant (i.e, 
zero-copy), then the "special iobuf around user's memory" gets picked up 
and read() from socket is performed directly into user's memory. 
Similar, but equivalent, changes will have to be done in RDMA 
(Raghavendra on CC can help). Since the goal is to avoid memory copy, 
this data will be bypassing io-cache (and purging pre-cached data of 
those regions along the way).

Hope that helps,

More information about the Gluster-devel mailing list