[Gluster-devel] RDMA: Patch to make use of pre registered memory
Mohammed Rafi K C
rkavunga at redhat.com
Fri Jan 23 12:43:45 UTC 2015
Hi All,
As I pointed out earlier, for rdma protocol, we need to register memory
which is used during rdma read and write with rdma device. In fact it is a
costly operation. To avoid the registration of memory in i/o path, we
came up with two solutions.
1) To use a separate per-registered iobuf_pool for rdma. The approach
needs an extra level copying in rdma for each read/write request. ie, we
need to copy the content of memory given by application to buffers of
rdma in the rdma code.
2) Register default iobuf_pool in glusterfs_ctx with rdma device during
the rdma
initialize. Since we are registering buffers from the default pool for
read/write, we don't require either registration or copying. But the
problem comes when io-cache translator is turned-on; then for each page
fault, io-cache will take a ref on the io-buf of the response buffer to
cache it, due to this all the pre-allocated buffer will get locked with
io-cache very soon.
Eventually all new requests would get iobufs from new iobuf_pools which
are not
registered with rdma and we will have to do registration for every iobuf.
To address this issue, we can:
i) Turn-off io-cache
(we chose this for testing)
ii) Use separate buffer for io-cache, and offload from
default pool to io-cache buffer.
(New thread to offload)
iii) Dynamically register each newly created arena with rdma,
for this need to bring libglusterfs code and transport
layer code together.
(Will need changes in packaging and may bring hard
dependencies of rdma libs)
iv) Increase the default pool size.
(Will increase the footprint of glusterfs process)
We implemented two approaches, (1) and (2i) to get some
performance numbers. The setup was 4*2 distributed-replicated volume
using ram disks as bricks to avoid hard disk bottleneck. And the numbers
are attached with the mail.
Please provide the your thoughts on these approaches.
Regards
Rafi KC
-------------- next part --------------
Seperate buffer for rdma (1) No change Register Default iobuf pool(2i)
write read io-cache off write read io-cache off write read io-cache off
1 373 527 656 343 483 532 446 512 696
2 380 528 668 347 485 540 426 525 715
3 376 527 594 346 482 540 422 526 720
4 381 533 597 348 484 540 413 526 710
5 372 527 479 347 482 538 422 519 719
Note: (varying result )
Average 376.4 528.4 598.8 346.2 483.2 538 425.8 521.6 712
command read: echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
write echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;
vol info "Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: rdma
Bricks:
Brick1: 192.168.44.105:/home/ram0/b0
Brick2: 192.168.44.106:/home/ram0/b0
Brick3: 192.168.44.107:/brick/0/b0
Brick4: 192.168.44.108:/brick/0/b0
Brick5: 192.168.44.105:/home/ram1/b1
Brick6: 192.168.44.106:/home/ram1/b1
Brick7: 192.168.44.107:/brick/1/b1
Brick8: 192.168.44.108:/brick/1/b1
Options Reconfigured:
performance.io-cache: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
More information about the Gluster-devel
mailing list