[Gluster-devel] RDMA: Patch to make use of pre registered memory
Ben England
bengland at redhat.com
Fri Jan 23 15:06:47 UTC 2015
Rafi, great results, thanks. Your "io-cache off" columns are read tests with the io-cache translator disabled, correct? What jumps out at me from your numbers are two things:
- io-cache translator destroys RDMA read performance.
- approach 2i) "register iobuf pool" is the best approach.
-- on reads with io-cache off, 32% better than baseline and 21% better than 1) "separate buffer"
-- on writes, 22% better than baseline and 14% better than 1)
Can someone explain to me why the typical Gluster site wants to use the io-cache translator, given that FUSE now caches file data? Should we just have it turned off by default at this point? This would buy us time to change io-cache implementation to be compatible with RDMA (see below option "2ii").
remaining comments inline
-ben
----- Original Message -----
> From: "Mohammed Rafi K C" <rkavunga at redhat.com>
> To: gluster-devel at gluster.org
> Cc: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Anand Avati" <avati at gluster.org>, "Ben Turner"
> <bturner at redhat.com>, "Ben England" <bengland at redhat.com>, "Suman Debnath" <sdebnath at redhat.com>
> Sent: Friday, January 23, 2015 7:43:45 AM
> Subject: RDMA: Patch to make use of pre registered memory
>
> Hi All,
>
> As I pointed out earlier, for rdma protocol, we need to register memory
> which is used during rdma read and write with rdma device. In fact it is a
> costly operation. To avoid the registration of memory in i/o path, we
> came up with two solutions.
>
> 1) To use a separate per-registered iobuf_pool for rdma. The approach
> needs an extra level copying in rdma for each read/write request. ie, we
> need to copy the content of memory given by application to buffers of
> rdma in the rdma code.
>
copying data defeats the whole point of RDMA, which is to *avoid* copying data.
> 2) Register default iobuf_pool in glusterfs_ctx with rdma device during
> the rdma
> initialize. Since we are registering buffers from the default pool for
> read/write, we don't require either registration or copying.
This makes far more sense to me.
> But the
> problem comes when io-cache translator is turned-on; then for each page
> fault, io-cache will take a ref on the io-buf of the response buffer to
> cache it, due to this all the pre-allocated buffer will get locked with
> io-cache very soon.
> Eventually all new requests would get iobufs from new iobuf_pools which
> are not
> registered with rdma and we will have to do registration for every iobuf.
> To address this issue, we can:
>
> i) Turn-off io-cache
> (we chose this for testing)
> ii) Use separate buffer for io-cache, and offload from
> default pool to io-cache buffer.
> (New thread to offload)
I think this makes sense, because if you get a io-cache translator cache hit, then you don't need to go out to the network, so io-cache memory doesn't have to be registered with RDMA.
> iii) Dynamically register each newly created arena with rdma,
> for this need to bring libglusterfs code and transport
> layer code together.
> (Will need changes in packaging and may bring hard
> dependencies of rdma libs)
> iv) Increase the default pool size.
> (Will increase the footprint of glusterfs process)
>
registration with RDMA only makes sense to me when data is going to be sent/received over the RDMA network. Is it hard to tell in advance which buffers will need to be transmitted?
> We implemented two approaches, (1) and (2i) to get some
> performance numbers. The setup was 4*2 distributed-replicated volume
> using ram disks as bricks to avoid hard disk bottleneck. And the numbers
> are attached with the mail.
>
>
> Please provide the your thoughts on these approaches.
>
> Regards
> Rafi KC
>
>
>
-------------- next part --------------
Seperate buffer for rdma (1) No change Register Default iobuf pool(2i)
write read io-cache off write read io-cache off write read io-cache off
1 373 527 656 343 483 532 446 512 696
2 380 528 668 347 485 540 426 525 715
3 376 527 594 346 482 540 422 526 720
4 381 533 597 348 484 540 413 526 710
5 372 527 479 347 482 538 422 519 719
Note: (varying result )
Average 376.4 528.4 598.8 346.2 483.2 538 425.8 521.6 712
command read: echo 3 > /proc/sys/vm/drop_caches; dd if=/home/ram0/mount0/foo.txt of=/dev/null bs=1024K count=1000;
write echo 3 > /proc/sys/vm/drop_caches; dd of=/home/ram0/mount0/foo.txt if=/dev/zero bs=1024K count=1000 conv=sync;
vol info "Volume Name: xcube
Type: Distributed-Replicate
Volume ID: 84cbc80f-bf93-4b10-9865-79a129efe2f5
Status: Started
Snap Volume: no
Number of Bricks: 4 x 2 = 8
Transport-type: rdma
Bricks:
Brick1: 192.168.44.105:/home/ram0/b0
Brick2: 192.168.44.106:/home/ram0/b0
Brick3: 192.168.44.107:/brick/0/b0
Brick4: 192.168.44.108:/brick/0/b0
Brick5: 192.168.44.105:/home/ram1/b1
Brick6: 192.168.44.106:/home/ram1/b1
Brick7: 192.168.44.107:/brick/1/b1
Brick8: 192.168.44.108:/brick/1/b1
Options Reconfigured:
performance.io-cache: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
More information about the Gluster-devel
mailing list