[Gluster-devel] RDMA: Patch to make use of pre registered memory

Fri Jan 23 12:43:45 UTC 2015

Hi All,

As I pointed out earlier, for rdma protocol, we need to register memory
which is used during rdma read and write with rdma device. In fact it is a
costly operation. To avoid the registration of memory in i/o path, we
came up with two solutions.

1) To use a separate per-registered iobuf_pool for rdma. The approach
needs an extra level copying in rdma for each read/write request. ie, we
need to copy the content of memory given by application to buffers of
rdma in the rdma code.

2) Register default iobuf_pool in glusterfs_ctx with rdma device during
the rdma
initialize. Since we are registering buffers from the default pool for
read/write, we don't require either registration or copying. But the
problem comes when io-cache translator is turned-on; then for each page
fault, io-cache will take a ref on the io-buf of the response buffer to
cache it, due to this all the pre-allocated buffer will get locked with
io-cache very soon.
Eventually all new requests would get iobufs from new iobuf_pools which
are not
registered with rdma and we will have to do registration for every iobuf.
To address this issue, we can:

             i)  Turn-off io-cache
(we chose this for testing)
            ii)  Use separate buffer for io-cache, and offload from
                default pool to io-cache buffer.
(New thread to offload)
            iii) Dynamically register each newly created arena with rdma,
                 for this need to bring libglusterfs code and transport
layer code together.
                     (Will need changes in packaging and may bring hard
dependencies of rdma libs)
           iv) Increase the default pool size.
                    (Will increase the footprint of glusterfs process)

We implemented two approaches,  (1) and (2i) to get some
performance numbers. The setup was 4*2 distributed-replicated volume
using ram disks as bricks to avoid hard disk bottleneck. And the numbers
are attached with the mail.

Please provide the your thoughts on these approaches.

Regards
Rafi KC

-------------- next part --------------
	Seperate buffer for rdma (1)		No change		Register Default iobuf pool(2i)	
	write	read	io-cache off	write	read	io-cache off	write	read	io-cache off
1	373	527	656		343	483	532		446	512	696
2	380	528	668		347	485	540		426	525	715
3	376	527	594		346	482	540		422	526	720
4	381	533	597		348	484	540		413	526	710
5	372	527	479		347	482	538		422	519	719
Note: (varying result )
Average	376.4	528.4	598.8		346.2	483.2	538		425.8	521.6	712

command	read:   echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
	write	echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;

vol info	"Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: rdma
Bricks:
Brick1: 192.168.44.105:/home/ram0/b0
Brick2: 192.168.44.106:/home/ram0/b0
Brick3: 192.168.44.107:/brick/0/b0
Brick4: 192.168.44.108:/brick/0/b0
Brick5: 192.168.44.105:/home/ram1/b1
Brick6: 192.168.44.106:/home/ram1/b1
Brick7: 192.168.44.107:/brick/1/b1
Brick8: 192.168.44.108:/brick/1/b1
Options Reconfigured:
performance.io-cache: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable