[Gluster-devel] RDMA: Patch to make use of pre registered memory

Fri Jan 23 15:06:47 UTC 2015

Rafi, great results, thanks.   Your "io-cache off" columns are read tests with the io-cache translator disabled, correct?  What jumps out at me from your numbers are two things:

- io-cache translator destroys RDMA read performance. 
- approach 2i) "register iobuf pool" is the best approach.
-- on reads with io-cache off, 32% better than baseline and 21% better than 1) "separate buffer" 
-- on writes, 22% better than baseline and 14% better than 1)

Can someone explain to me why the typical Gluster site wants to use the io-cache translator, given that FUSE now caches file data?  Should we just have it turned off by default at this point?  This would buy us time to change io-cache implementation to be compatible with RDMA (see below option "2ii").

remaining comments inline

-ben

----- Original Message -----
> From: "Mohammed Rafi K C" <rkavunga at redhat.com>
> To: gluster-devel at gluster.org
> Cc: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Anand Avati" <avati at gluster.org>, "Ben Turner"
> <bturner at redhat.com>, "Ben England" <bengland at redhat.com>, "Suman Debnath" <sdebnath at redhat.com>
> Sent: Friday, January 23, 2015 7:43:45 AM
> Subject: RDMA: Patch to make use of pre registered memory
> 
> Hi All,
> 
> As I pointed out earlier, for rdma protocol, we need to register memory
> which is used during rdma read and write with rdma device. In fact it is a
> costly operation. To avoid the registration of memory in i/o path, we
> came up with two solutions.
> 
> 1) To use a separate per-registered iobuf_pool for rdma. The approach
> needs an extra level copying in rdma for each read/write request. ie, we
> need to copy the content of memory given by application to buffers of
> rdma in the rdma code.
> 

copying data defeats the whole point of RDMA, which is to *avoid* copying data.   

> 2) Register default iobuf_pool in glusterfs_ctx with rdma device during
> the rdma
> initialize. Since we are registering buffers from the default pool for
> read/write, we don't require either registration or copying. 

This makes far more sense to me.

> But the
> problem comes when io-cache translator is turned-on; then for each page
> fault, io-cache will take a ref on the io-buf of the response buffer to
> cache it, due to this all the pre-allocated buffer will get locked with
> io-cache very soon.
> Eventually all new requests would get iobufs from new iobuf_pools which
> are not
> registered with rdma and we will have to do registration for every iobuf.
> To address this issue, we can:
> 
>              i)  Turn-off io-cache
> (we chose this for testing)
>             ii)  Use separate buffer for io-cache, and offload from
>                 default pool to io-cache buffer.
> (New thread to offload)

I think this makes sense, because if you get a io-cache translator cache hit, then you don't need to go out to the network, so io-cache memory doesn't have to be registered with RDMA.

>             iii) Dynamically register each newly created arena with rdma,
>                  for this need to bring libglusterfs code and transport
> layer code together.
>                      (Will need changes in packaging and may bring hard
> dependencies of rdma libs)
>            iv) Increase the default pool size.
>                     (Will increase the footprint of glusterfs process)
> 

registration with RDMA only makes sense to me when data is going to be sent/received over the RDMA network.  Is it hard to tell in advance which buffers will need to be transmitted?

> We implemented two approaches,  (1) and (2i) to get some
> performance numbers. The setup was 4*2 distributed-replicated volume
> using ram disks as bricks to avoid hard disk bottleneck. And the numbers
> are attached with the mail.
> 
> 
> Please provide the your thoughts on these approaches.
> 
> Regards
> Rafi KC
> 
> 
> 
-------------- next part --------------
	Seperate buffer for rdma (1)		No change		Register Default iobuf pool(2i)	
	write	read	io-cache off	write	read	io-cache off	write	read	io-cache off
1	373	527	656		343	483	532		446	512	696
2	380	528	668		347	485	540		426	525	715
3	376	527	594		346	482	540		422	526	720
4	381	533	597		348	484	540		413	526	710
5	372	527	479		347	482	538		422	519	719
Note: (varying result )
Average	376.4	528.4	598.8		346.2	483.2	538		425.8	521.6	712

command	read:   echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
	write	echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;

vol info	"Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: rdma
Bricks:
Brick1: 192.168.44.105:/home/ram0/b0
Brick2: 192.168.44.106:/home/ram0/b0
Brick3: 192.168.44.107:/brick/0/b0
Brick4: 192.168.44.108:/brick/0/b0
Brick5: 192.168.44.105:/home/ram1/b1
Brick6: 192.168.44.106:/home/ram1/b1
Brick7: 192.168.44.107:/brick/1/b1
Brick8: 192.168.44.108:/brick/1/b1
Options Reconfigured:
performance.io-cache: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable