[Gluster-users] tips/nest practices for gluster rdma?

Wed Jul 10 18:49:45 UTC 2013

Well, first of all,thank for the responses. The volume WAS failing over the
tcp just as predicted,though WHY is unclear as the fabric is know working
(has about 28K compute cores on it all doing heavy MPI testing on it), and
the OFED/verbs stack is consistent across all client/storage systems
(actually, the OS image is identical).

Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes
and effort around planning for 3.4 for this storage systems, specifically
for RDMA support (well, with warnings to the team that it wasn't in/test
for 3.3 and that all we could do was HOPE it was in 3.4 and in time for
when we want to go live). we're getting "okay" performance out of IPoIB
right now, and our bottle neck actually seems to be the fabric
design/layout, as we're peaking at about 4.2GB/s writing 10TB over 160
threads to this distributed volume.

When it IS ready and in 3.4.1 (hopefully!), having good docs around it, and
maybe even a simple printf for the tcp failover would be huge for us.

--
Matthew Nicholson
Research Computing Specialist
Harvard FAS Research Computing
matthew_nicholson at harvard.edu

On Wed, Jul 10, 2013 at 3:18 AM, Justin Clift <jclift at redhat.com> wrote:

> Hi guys,
>
> As an FYI, from discussion on gluster-devel IRC yesterday, the RDMA code
> still isn't in a good enough state for production usage with 3.4.0. :(
>
> There are still outstanding bugs with it, and I'm working to make the
> Gluster Test Framework able to work with RDMA so we can help shake out
> more of them:
>
>
> http://www.gluster.org/community/documentation/index.php/Using_the_Gluster_Test_Framework
>
> Hopefully RDMA will be ready for 3.4.1, but don't hold me to that at
> this stage. :)
>
> Regards and best wishes,
>
> Justin Clift
>
>
> On 09/07/2013, at 8:36 PM, Ryan Aydelott wrote:
> > Matthew,
> >
> > Personally - I have experienced this same problem (even with the mount
> being something.rdma). Running 3.4beta4, if I mounted a volume via RDMA
> that also had TCP configured as a transport option (which obviously you do
> based on the mounts you gave below), if there is ANY issue with RDMA not
> working the mount will silently fall back to TCP. This problem is described
> here: https://bugzilla.redhat.com/show_bug.cgi?id=982757
> >
> > The way to test for this behavior is create a new volume specifying ONLY
> RDMA as the transport. If you mount this and your RDMA is broken for
> whatever reason - it will simply fail to mount.
> >
> > Assuming this test fails, I would then tail the logs for the volume to
> get a hint of what's going on. In my case there was an RDMA_CM kernel
> module that was not loaded which started to matter as of 3.4beta2 IIRC as
> they did a complete rewrite for this based on poor performance in prior
> releases. The clue in my volume log file was "no such file or directory"
> preceded with an rdma_cm.
> >
> > Hope that helps!
> >
> >
> > -ryan
> >
> >
> > On Jul 9, 2013, at 2:03 PM, Matthew Nicholson <
> matthew_nicholson at harvard.edu> wrote:
> >
> >> Hey guys,
> >>
> >> So, we're testing Gluster RDMA storage, and are having some issues.
> Things are working...just not as we expected them. THere isn't a whole lot
> in the way, that I've foudn on docs for gluster rdma, aside from basically
> "install gluster-rdma", create a volume with transport=rdma, and mount w/
> transport=rdma....
> >>
> >> I've done that...and the IB fabric is known to be good...however, a
> volume created with transport=rdma,tcp and mounted w/ transport=rdma, still
> seems to go over tcp?
> >>
> >> A little more info about the setup:
> >>
> >> we've got 10 storage nodes/bricks, each of which has a single 1GB NIC
> and a FRD IB port. Likewise for the test clients. Now, the 1GB nic is for
> management only, and we have all of the systems on this fabric configured
> with IPoIB, so there is eth0, and ib0 on each node.
> >>
> >> All storage nodes are peer'd using the ib0 interface, ie:
> >>
> >> gluster peer probe storage_node01-ib
> >> etc
> >>
> >> thats all well and good.
> >>
> >> Volume was created:
> >>
> >> gluster volume create holyscratch transport rdma,tcp
> holyscratch01-ib:/holyscratch01/brick
> >> for i in `seq -w 2 10` ; do gluster volume add-brick holyscratch
> holyscratch${i}-ib:/holyscratch${i}/brick; done
> >>
> >> yielding:
> >>
> >> Volume Name: holyscratch
> >> Type: Distribute
> >> Volume ID: 788e74dc-6ae2-4aa5-8252-2f30262f0141
> >> Status: Started
> >> Number of Bricks: 10
> >> Transport-type: tcp,rdma
> >> Bricks:
> >> Brick1: holyscratch01-ib:/holyscratch01/brick
> >> Brick2: holyscratch02-ib:/holyscratch02/brick
> >> Brick3: holyscratch03-ib:/holyscratch03/brick
> >> Brick4: holyscratch04-ib:/holyscratch04/brick
> >> Brick5: holyscratch05-ib:/holyscratch05/brick
> >> Brick6: holyscratch06-ib:/holyscratch06/brick
> >> Brick7: holyscratch07-ib:/holyscratch07/brick
> >> Brick8: holyscratch08-ib:/holyscratch08/brick
> >> Brick9: holyscratch09-ib:/holyscratch09/brick
> >> Brick10: holyscratch10-ib:/holyscratch10/brick
> >> Options Reconfigured:
> >> nfs.disable: on
> >>
> >>
> >> For testing, we wanted to see how rdma stacked up vs tcp using IPoIB,
> so we mounted this like:
> >>
> >> [root at holy2a01202 holyscratch.tcp]# df -h |grep holyscratch
> >> holyscratch:/holyscratch
> >>                       273T  4.1T  269T   2% /n/holyscratch.tcp
> >> holyscratch:/holyscratch.rdma
> >>                       273T  4.1T  269T   2% /n/holyscratch.rdma
> >>
> >> so, 2 mounts, same volume different transports. fstab looks like:
> >>
> >> holyscratch:/holyscratch        /n/holyscratch.tcp      glusterfs
> transport=tcp,fetch-attempts=10,gid-timeout=2,acl,_netdev       0       0
> >> holyscratch:/holyscratch        /n/holyscratch.rdma     glusterfs
> transport=rdma,fetch-attempts=10,gid-timeout=2,acl,_netdev      0       0
> >>
> >> where holyscratch is a RRDNS entry for all the IPoIB interfaces for
> fetching the volfile (something it seems, just like peering, MUST be tcp? )
> >>
> >> but, again, when running just dumb,dumb,dumb tests (160 threads of dd
> over 8 nodes w/ each thread writing 64GB, so a 10TB throughput test), I'm
> seeing all the traffic on the IPoIB interface for both RDMA and TCP
> transports...when i really shouldn't be seeing ANY tcp traffic, aside from
> volfile fetches/management on the IPoIB interface when using RDMA as a
> transport...right? As a result, from early tests (the bigger 10TB ones are
> running now), the tpc and rdma speeds were basically the same...when i
> would expect the RDMA one to be at least slightly faster...
> >>
> >>
> >> Oh, and this is all 3.4beta4, on both the clients and storage nodes.
> >>
> >> So, I guess my questions are:
> >>
> >> Is this expected/normal?
> >> Is peering/volfile fetching always tcp based?
> >> How should one peer nodes in a RDMA setup?
> >> Should this be tried with only RDMA as a transport on the volume?
> >> Are there more detailed docs for RDMA gluster coming w/ the 3.4 release?
> >>
> >>
> >> --
> >> Matthew Nicholson
> >> Research Computing Specialist
> >> Harvard FAS Research Computing
> >> matthew_nicholson at harvard.edu
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
> --
> Open Source and Standards @ Red Hat
>
> twitter.com/realjustinclift
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130710/06e6ad61/attachment.html>