[Gluster-devel] Gluster 3.6.2 On Xeon Phi

Thu Feb 12 08:54:55 UTC 2015

On 02/12/2015 08:32 AM, Rudra Siva wrote:
> Rafi,
>
> I'm preparing the Phi RDMA patch for submission

If you can send a patch to support iWARP, that will be a great addition
to gluster rdma. 

>  - definitely
> performance is better with the buffer pre-registration fixes. My patch
> will be without your fixes and doesn't rely on your enhancements so it
> can come in at any time. There are two default values that Phi
> generally has a problem with:
>
> options->send_count = 4096;
> options->recv_count = 4096;
>
> Is there anything that relies on these values being at 4096?
This count will be used for,

1) Max number of work request.
2) Number of posts created. Posts are created to send the rdma request
to remote side.
3) To specify the size of completion queue element.
> Presently
> have it set to 256 for the Phi to initialize quickly -

It will initialize quickly, because of we are doing a registration of
256 instead of 4096.  

>  has been
> working fine.
>
>
> On Mon, Feb 9, 2015 at 7:23 AM, Rudra Siva <rudrasiva11 at gmail.com> wrote:
>> In rdma.c : gf_rdma_do_reads : pthread_mutex_lock
>> (&priv->write_mutex); - lock guards against what?

The shared variable is priv->connected, and priv->device. To avoid a
device removal in middle, we have to take locks on priv.

Rafi KC

>>
>>
>> On Mon, Feb 9, 2015 at 1:10 AM, Mohammed Rafi K C <rkavunga at redhat.com> wrote:
>>> On 02/08/2015 07:52 PM, Rudra Siva wrote:
>>>> Thanks for trying and sending the changes - finally got it all working
>>>> ... it turned out to be a problem with my changes (in
>>>> gf_rdma_post_unref - goes back to lack of SRQ on the interface)
>>>>
>>>> You may be able to simulate the crash if you set volume parameters to
>>>> something like the following (it would be purely academic):
>>>>
>>>> gluster volume set data_volume diagnostics.brick-log-level TRACE
>>>> gluster volume set data_volume diagnostics.client-log-level TRACE
>>>>
>>>> Had those because stuff began from communication problems (queue size,
>>>> lack of SRQ) so things have come a long way from there - will test for
>>>> some more time and make my small changes available.
>>>>
>>>> The transfer speeds of the default VE (Virtual Ethernet) that Intel
>>>> ships with it is ~6 MB/sec  - presently with Gluster I see around 80
>>>> MB/sec on the virtual IB (there is no real infiniband card) and with a
>>>> stable gluster mount. The interface benchmarks show it can give 5000
>>>> MB/sec so there looks to be more room for improvement - stable gluster
>>>> mount is required first though for doing anything.
>>>>
>>>> Questions:
>>>>
>>>> 1. ctx is shared between posts - parts of code with locks and without
>>>> - intentional/oversight?
>>> I didn't get your question properly. If you are talking about the ctx
>>> inside the post variable, it is not shared.
>>>
>>>> 2.  iobuf_pool->default_page_size  = 128 * GF_UNIT_KB - why is 128 KB
>>>> chosen and not higher?
>>> For glusterfs default page size is 128KB. May be because of fuse is
>>> limited to 128KB. I'm not sure about the exact reason.
>>>
>>>> -Siva
>>>>
>>>>
>>>> On Fri, Feb 6, 2015 at 6:12 AM, Mohammed Rafi K C <rkavunga at redhat.com> wrote:
>>>>> On 02/06/2015 05:31 AM, Rudra Siva wrote:
>>>>>> Rafi,
>>>>>>
>>>>>> Sorry it took me some time - I had to merge these with some of my
>>>>>> changes - the scif0 (iWARP) does not support SRQ (max_srq : 0) so have
>>>>>> changed some of the code to use QP instead - can provide those if
>>>>>> there is interest after this is stable.
>>>>>>
>>>>>> Here's the good -
>>>>>>
>>>>>> The performance with the patches is better than without (esp.
>>>>>> http://review.gluster.org/#/c/9327/).
>>>>> Good to hear. My thought was, http://review.gluster.org/#/c/9506/  will
>>>>> give a much better performance than the others :-) . A rebase is needed
>>>>> if it is applying on top the other patches.
>>>>>
>>>>>> The bad - glusterfsd crashes for large files so it's difficult to get
>>>>>> some decent benchmark numbers
>>>>> Thanks for rising the bug. I tried to reproduce the problem on 3.6.2
>>>>> version+the four patches with a simple distributed volume. But I
>>>>> couldn't reproduce the same, and still trying. (we are using mellanox ib
>>>>> cards).
>>>>>
>>>>> If possible can you please share the volume info and workload used for
>>>>> large files.
>>>>>
>>>>>
>>>>>> - small ones look good - trying to
>>>>>> understand the patch at this time. Looks like this code comes from
>>>>>> 9327 as well.
>>>>>>
>>>>>> Can you please review the reset of mr_count?
>>>>> Yes, The problem could be the wrong value in mr_count. And I guess we
>>>>> failed to reset the value to zero, so that for some I/O mr_count will be
>>>>> incremented couple of times. So the variable might be got overflown. Can
>>>>> you apply the patch attached with mail, and try with this.
>>>>>
>>>>>> Info from gdb is as follows - if you need more or something jumps out
>>>>>> please feel free to let me know.
>>>>>>
>>>>>> (gdb) p *post
>>>>>> $16 = {next = 0x7fffe003b280, prev = 0x7fffe0037cc0, mr =
>>>>>> 0x7fffe0037fb0, buf = 0x7fffe0096000 "\005\004", buf_size = 4096, aux
>>>>>> = 0 '\000',
>>>>>>   reused = 1, device = 0x7fffe00019c0, type = GF_RDMA_RECV_POST, ctx =
>>>>>> {mr = {0x7fffe0003020, 0x7fffc8005f20, 0x7fffc8000aa0, 0x7fffc80030c0,
>>>>>>       0x7fffc8002d70, 0x7fffc8008bb0, 0x7fffc8008bf0, 0x7fffc8002cd0},
>>>>>> mr_count = -939493456, vector = {{iov_base = 0x7ffff7fd6000,
>>>>>>         iov_len = 112}, {iov_base = 0x7fffbf140000, iov_len = 131072},
>>>>>> {iov_base = 0x0, iov_len = 0} <repeats 14 times>}, count = 2,
>>>>>>     iobref = 0x7fffc8001670, hdr_iobuf = 0x61d710, is_request = 0
>>>>>> '\000', gf_rdma_reads = 1, reply_info = 0x0}, refcount = 1, lock = {
>>>>>>     __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
>>>>>> __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
>>>>>>     __size = '\000' <repeats 39 times>, __align = 0}}
>>>>>>
>>>>>> (gdb) bt
>>>>>> #0  0x00007fffe7142681 in __gf_rdma_register_local_mr_for_rdma
>>>>>> (peer=0x7fffe0001800, vector=0x7fffe003b108, count=1,
>>>>>> ctx=0x7fffe003b0b0)
>>>>>>     at rdma.c:2255
>>>>>> #1  0x00007fffe7145acd in gf_rdma_do_reads (peer=0x7fffe0001800,
>>>>>> post=0x7fffe003b070, readch=0x7fffe0096010) at rdma.c:3609
>>>>>> #2  0x00007fffe714656e in gf_rdma_recv_request (peer=0x7fffe0001800,
>>>>>> post=0x7fffe003b070, readch=0x7fffe0096010) at rdma.c:3859
>>>>>> #3  0x00007fffe714691d in gf_rdma_process_recv (peer=0x7fffe0001800,
>>>>>> wc=0x7fffceffcd20) at rdma.c:3967
>>>>>> #4  0x00007fffe7146e7d in gf_rdma_recv_completion_proc
>>>>>> (data=0x7fffe0002b30) at rdma.c:4114
>>>>>> #5  0x00007ffff72cfdf3 in start_thread () from /lib64/libpthread.so.0
>>>>>> #6  0x00007ffff6c403dd in clone () from /lib64/libc.so.6
>>>>>>
>>>>>> On Fri, Jan 30, 2015 at 7:11 AM, Mohammed Rafi K C <rkavunga at redhat.com> wrote:
>>>>>>> On 01/29/2015 06:13 PM, Rudra Siva wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Have been able to get Gluster running on Intel's MIC platform. The
>>>>>>>> only code change to Gluster source was an unresolved yylex (I am not
>>>>>>>> really sure why that was coming up - may be someone more familiar with
>>>>>>>> it's use in Gluster can answer).
>>>>>>>>
>>>>>>>> At the step for compiling the binaries (glusterd, glusterfsd,
>>>>>>>> glusterfs, glfsheal)  build breaks with an unresolved yylex error.
>>>>>>>>
>>>>>>>> For now have a routine yylex that simply calls graphyylex - I don't
>>>>>>>> know if this is even correct however mount functions.
>>>>>>>>
>>>>>>>> GCC - 4.7 (it's an oddity, latest GCC is missing the Phi patches)
>>>>>>>>
>>>>>>>> flex --version
>>>>>>>> flex 2.5.39
>>>>>>>>
>>>>>>>> bison --version
>>>>>>>> bison (GNU Bison) 3.0
>>>>>>>>
>>>>>>>> I'm still working on testing the RDMA and Infiniband support and can
>>>>>>>> make notes, numbers available when that is complete.
>>>>>>> There are couple of rdma performance related patches under review. If
>>>>>>> you could make use of those patches, I hope that will give a performance
>>>>>>> enhancement.
>>>>>>>
>>>>>>> [1] : http://review.gluster.org/#/c/9329/
>>>>>>> [2] : http://review.gluster.org/#/c/9321/
>>>>>>> [3] : http://review.gluster.org/#/c/9327/
>>>>>>> [4] : http://review.gluster.org/#/c/9506/
>>>>>>>
>>>>>>> Let me know if you need any clarification.
>>>>>>>
>>>>>>> Regards!
>>>>>>> Rafi KC
>>>>
>>
>>
>> --
>> -Siva
>
>